Week 9b - Recommender Systems

1. Introduction

  • Two motivations for talking about recommender systems
    • Important application of ML systems
    • Share some big ideas in machine learning
      • like, learn what features to use.

2. Problem Formulation

  • Example: Predicting movie ratings
    • The users have already rated some of movies. And we want to know that if they like the movies they haven't rated.
    • To simplify this example, we set that, users rates movies using from zero to five stars.
    • Notations:
      • nun_u = no. users
      • nmn_m = no. movies
        • In this case, nu=4,nm=5n_u = 4, n_m = 5
      • r(i,j)r(i,j) = 1 if user j has rated movie i (o otherwise).
      • y(i,j)y^{ (i,j) } = rating given by user j to movie i (defined only if r(i,j)=1r(i,j)=1.
    • For each user jj, learn a parameter θ(j)R3\theta^{ (j) } \in \mathbb{ R }^3. Predict user jj as rating movie ii with (θ(j))T(x(i))(\theta^{ (j) })^T(x^{ (i) }) stars.
      • More generally, θ(j)Rn+1\theta^{ (j) } \in \mathbb{ R }^{ n+1 }, included the bias term θ0(j)\theta_0^{ (j) }. n is the number of features.
      • In this example, we got romance and action two features for every movie.
    • Extra notaions:
      • θ(j)\theta^{ (j) } = parameter vector for user jj
        • θ(1)\theta^{ (1) } is for Alice
      • x(i)x^{ (i) } = feature vector for movie ii
        • For movie "Love at Last": x(1)=[10.90]x^{ (1) } = \begin {bmatrix}1 \\ 0.9 \\ 0\end {bmatrix}
      • m(j)m^{ (j) } = no. of movies rated by user jj.
    • For user jj, movie ii, predicted rating: (θ(j))T(x(i))(\theta^{ (j) })^T(x^{ (i) })
    • So if θ(1)=[050]\theta^{ (1) } = \begin {bmatrix}0 \\ 5 \\ 0\end {bmatrix}. Check the movie "Cute puppies of love": x(3)=[10.990]x^{ (3) } = \begin {bmatrix}1 \\ 0.99 \\ 0\end {bmatrix}, then (θ(1))T(x(3))=5×0.99=4.95(\theta^{ (1) })^T(x^{ (3) }) = 5 \times 0.99 = 4.95, which is pretty good estimate for user Alice.

2.1. Optimization objective

  • To learn θ(j)\theta^{ (j) } ( parameter for user jj ):  min θ(j)12i:r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2k=1n(θk(j))2\underset{ \theta^{ (j) } }{ \text{ min } } \frac{ 1 }{ 2 } \sum_{ i: r(i,j) = 1 }((\theta^{ (j) })^Tx^{ (i) } - y^{ (i,j) })^2 + \frac{ \lambda }{ 2 }\sum_{ k=1 }^n(\theta_k^{ (j) })^2
    • Compare to the linear regression, we just get rid of the 1m\frac{ 1 }{ m } term.
  • To learn θ(1),θ(2),,θ(nu)\theta^{ (1) }, \theta^{ (2) }, \ldots, \theta^{ (n_u) }:  min θ(1),,θ(nu)12j=1nui:r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2j=1nuk=1n(θk(j))2\underset{ \theta^{ (1) }, \ldots, \theta^{ (n_u) } }{ \text{ min } } \frac{ 1 }{ 2 } \sum_{ j=1 }^{ n_u }\sum_{ i: r(i,j) = 1 }((\theta^{ (j) })^Tx^{ (i) } - y^{ (i,j) })^2 + \frac{ \lambda }{ 2 }\sum_{ j=1 }^{ n_u }\sum_{ k=1 }^n(\theta_k^{ (j) })^2
  • Gradient descent update: θk(j):=θk(j)αi:r(i,j)=1((θ(j))Tx(i)y(i,j))xk(i) ( for  k=0)θk(j):=θk(j)α(i:r(i,j)=1((θ(j))Tx(i)y(i,j))xk(i)+λθk(j)) ( for  k0)\begin {aligned} \theta_k^{ (j) } &:= \theta_k^{ (j) } - \alpha \sum_{ i: r(i,j) = 1 }((\theta^{ (j) })^T x^{ (i) } - y^{ (i,j) }) x_k^{ (i) } \ (\text{ for }\ k = 0) \\ \theta_k^{ (j) } &:= \theta_k^{ (j) } - \alpha \Big(\sum_{ i: r(i,j) = 1 }((\theta^{ (j) })^T x^{ (i) } - y^{ (i,j) }) x_k^{ (i) } + \lambda \theta_k^{ (j) }\Big) \ (\text{ for }\ k \ne 0) \end {aligned}

3. Collaborative Filtering

  • One of the property of collaborative filtering is: feature learning, which is an algorithm that can learn for itself what features to use.

  • Let's make a different assumption:

    • We don't know the movies' categories, romance or action.
    • But we know our users' hobbits. Alice and Bob like romance movie, and Carol and Dave like action movie. So we generated the θ(j)\theta{ (j) } vectors above.
    • So, to movie "Love at last" (x(1)x^{ (1) }) and User "Alice" (θ(1)\theta^{ (1) }), we should get (θ(1))Tx(1)=5(\theta^{ (1) })^Tx^{ (1) } = 5, and to the rest users: (θ(2))Tx(1)=5,(θ(3))Tx(1)=0,(θ(4))Tx(1)=0(\theta^{ (2) })^Tx^{ (1) } = 5, (\theta^{ (3) })^Tx^{ (1) } = 0, (\theta^{ (4) })^Tx^{ (1) } = 0, then we can guess: x(1)=[11.00.0]x^{ (1) } = \begin {bmatrix}1 \\ 1.0 \\ 0.0 \end {bmatrix}

3.1. Optimization Algorithm

  • Given θ(1),,θ(nu)\theta^{ (1) }, \ldots, \theta^{ (n_u) }, to learn x(i)x^{ (i) } :  min x(i)12i:r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2k=1n(xk(j))2\underset{ x^{ (i) } }{ \text{ min } } \frac{ 1 }{ 2 } \sum_{ i: r(i,j) = 1 }((\theta^{ (j) })^Tx^{ (i) } - y^{ (i,j) })^2 + \frac{ \lambda }{ 2 }\sum_{ k=1 }^n(x_k^{ (j) })^2
  • Given θ(1),,θ(nu)\theta^{ (1) }, \ldots, \theta^{ (n_u) }, to learn x(1),x(2),,x(nm) x^{ (1) }, x^{ (2) }, \ldots, x^{ (n_m) } :  min x(1),,x(nm)12i=1nmj:r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2i=1nmk=1n(xk(i))2\underset{ x^{ (1) }, \ldots, x^{ (n_m) } }{ \text{ min } } \frac{ 1 }{ 2 } \sum_{ i=1 }^{ n_m }\sum_{ j: r(i,j) = 1 }((\theta^{ (j) })^Tx^{ (i) } - y^{ (i,j) })^2 + \frac{ \lambda }{ 2 }\sum_{ i=1 }^{ n_m }\sum_{ k=1 }^n(x_k^{ (i) })^2

3.2. Combine the Algorithms Together

  • In the past two part we know:
    • Given θ(1),,θ(nu)\theta^{ (1) }, \ldots, \theta^{ (n_u) }, can estimate x(1),x(2),,x(nm)x^{ (1) }, x^{ (2) }, \ldots, x^{ (n_m) },
    • Given x(1),x(2),,x(nm)x^{ (1) }, x^{ (2) }, \ldots, x^{ (n_m) }, can estimate θ(1),,θ(nu)\theta^{ (1) }, \ldots, \theta^{ (n_u) }.
  • But which goes first?
  • In some ideas, we can random initialize Θ\Thetas to get XX, then use XX to get a better Θ\Theta. In this back and forth process to estimate theta and x.

3.3. Collaborative Filtering Algorithm

  • To avoid back and forth process, we come out this new algorithm, to minimize x(1),x(2),,x(nm)x^{ (1) }, x^{ (2) }, \ldots, x^{ (n_m) } and θ(1),,θ(nu)\theta^{ (1) }, \ldots, \theta^{ (n_u) } simultaneously: J(x(1),,x(nm),θ(1),,θ(nu))=12(i,j):r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2i=1nmk=1n(xk(i))2+λ2j=1nuk=1n(θk(j))2 J(x^{ (1) }, \ldots, x^{ (n_m) }, \theta^{ (1) }, \ldots, \theta^{ (n_u) }) = \frac{ 1 }{ 2 } \sum_{ (i,j): r(i,j) = 1 }((\theta^{ (j) })^Tx^{ (i) } - y^{ (i,j) })^2 + \frac{ \lambda }{ 2 }\sum_{ i=1 }^{ n_m }\sum_{ k=1 }^n(x_k^{ (i) })^2 + \frac{ \lambda }{ 2 }\sum_{ j=1 }^{ n_u }\sum_{ k=1 }^n(\theta_k^{ (j) })^2
  •  min x(1),,x(nm),θ(1),,θ(nu) J(x(1),,x(nm),θ(1),,θ(nu))\underset{ x^{ (1) }, \ldots, x^{ (n_m) }, \theta^{ (1) }, \ldots, \theta^{ (n_u) } }{ \text{ min } }\ J(x^{ (1) }, \ldots, x^{ (n_m) }, \theta^{ (1) }, \ldots, \theta^{ (n_u) })

    • Notice, (i,j):r(i,j)=1((θ(j))Tx(i)y(i,j))2\displaystyle\sum_{ (i,j): r(i,j) = 1 }((\theta^{ (j) })^Tx^{ (i) } - y^{ (i,j) })^2
      • = i=1nmj:r(i,j)=1((θ(j))Tx(i)y(i,j))2\displaystyle\sum_{ i=1 }^{ n_m }\sum_{ j: r(i,j) = 1 }((\theta^{ (j) })^Tx^{ (i) } - y^{ (i,j) })^2
      • = j=1nui:r(i,j)=1((θ(j))Tx(i)y(i,j))2\displaystyle\sum_{ j=1 }^{ n_u }\sum_{ i: r(i,j) = 1 }((\theta^{ (j) })^Tx^{ (i) } - y^{ (i,j) })^2
  • For convenience, we get rid of θ0(j)\theta_0^{ (j) } and x0(i)x_0^{ (i) }, so x(i)Rnx^{ (i) } \in \mathbb{ R }^n and θ(i)Rn\theta^{ (i) } \in \mathbb{ R }^n.

    • The reason we do this, is that we are now learning all the features, so, there is no need to hard code the feature which always equals one.

3.4. Steps of Collaborative Filtering Algorithm

  1. Initialize x(1),,x(nm),θ(1),,θ(nu)x^{ (1) }, \ldots, x^{ (n_m) }, \theta^{ (1) }, \ldots, \theta^{ (n_u) } to small random values.
  2. Minimize J(x(1),,x(nm),θ(1),,θ(nu))J(x^{ (1) }, \ldots, x^{ (n_m) }, \theta^{ (1) }, \ldots, \theta^{ (n_u) }) use gradient descent (or an advanced optimization algorithm). E.g. for every j=1,,nu,i=1,,nmj = 1, \ldots, n_u, i= 1, \ldots, n_m: xk(j):=xk(j)α(j:r(i,j)=1((θ(j))Tx(i)y(i,j))θk(j)+λxk(i))θk(j):=θk(j)α(i:r(i,j)=1((θ(j))Tx(i)y(i,j))xk(i)+λθk(j))\begin {aligned} x_k^{ (j) } &:= x_k^{ (j) } - \alpha \Big(\sum_{ j: r(i,j) = 1 }((\theta^{ (j) })^T x^{ (i) } - y^{ (i,j) }) \theta_k^{ (j) } + \lambda x_k^{ (i) }\Big) \\ \theta_k^{ (j) } &:= \theta_k^{ (j) } - \alpha \Big(\sum_{ i: r(i,j) = 1 }((\theta^{ (j) })^T x^{ (i) } - y^{ (i,j) }) x_k^{ (i) } + \lambda \theta_k^{ (j) }\Big) \end {aligned}
    • We already get rid of θ0(j)\theta_0^{ (j) } and x0(i)x_0^{ (i) }, so no special cases.
  3. For a user with parameters thetatheta and a movie with (learned) features xx, predict a star rating of θTx\theta^T x.

4. Low Rank Matrix Factorization

4.1. Vectorization: Low Rank Matrix Factorization

  • Group the data with matrix, we get: Y=[55005??0?40?00540050]Y = \begin {bmatrix}5 & 5 & 0 & 0 \\ 5 & ? & ? & 0 \\ ? & 4 & 0 & ? \\ 0 & 0 & 5 & 4 \\ 0 & 0 & 5 & 0 \end {bmatrix}
  • And the predicted ratings algorithms can be writen as: [(θ(1))T(x(1))(θ(2))T(x(1))(θ(nu))T(x(1))(θ(1))T(x(2))(θ(2))T(x(2))(θ(nu))T(x(2))(θ(1))T(x(nm))(θ(2))T(x(nm))(θ(nu))T(x(nm))]\begin {bmatrix} (\theta^{ (1) })^T(x^{ (1) }) & (\theta^{ (2) })^T(x^{ (1) }) & \ldots & (\theta^{ (n_u) })^T(x^{ (1) }) \\ (\theta^{ (1) })^T(x^{ (2) }) & (\theta^{ (2) })^T(x^{ (2) }) & \ldots & (\theta^{ (n_u) })^T(x^{ (2) }) \\ \vdots & \vdots & \vdots & \vdots & \\ (\theta^{ (1) })^T(x^{ (n_m) }) & (\theta^{ (2) })^T(x^{ (n_m) }) & \ldots & (\theta^{ (n_u) })^T(x^{ (n_m) }) \\ \end {bmatrix}
    • The index (i,j)(i, j) corresponds to the rating that we predict user j give to movie i, which is (θ(j))T(x(i))(\theta^{ (j) })^T(x^{ (i) })
  • There is a vectorized way to write this:
    • In particular, if we define the matrix: X=[(x(1))T(x(nm)], Θ=[(θ(1))T(θ(nu)]X = \begin {bmatrix} - & (x^{ (1) })^T & - \\ & \vdots & \\ - & (x^{ (n_m) } & - \end {bmatrix},\ \Theta = \begin {bmatrix} - & (\theta^{ (1) })^T & - \\ & \vdots & \\ - & (\theta^{ (n_u) } & - \end {bmatrix}
    • Then we can write the predicted ratings as XΘTX\Theta^T.
    • And this algorithm called Low Rank Matrix Factorization(矩阵分解).
      • 将高维矩阵映射为两个低维矩阵.
  • For each product ii, we learn a feature vector x(i)Rnx^{ (i) } \in \mathbb{ R }^n.
    • Like, x_1 = romance, x_2 = action, ...
  • How to find movies jj related to movie ii ?
    • If the value of x(i)x(j)\lVert x^{ (i) } - x^{ (j) } \rVert is small.
  • So, if we want 5 most movies to movie ii:
    • Just find the 5 movies jj with the smallest x(i)x(j)\lVert x^{ (i) } - x^{ (j) } \rVert.

4.3. Implementational Detail: Mean Normalization

  • Say, we have a user Eve who haven't rated any movie yet. So,
  • We can initial all of the rating values to 0. Then to minimize cost J.
    • Since all of the ratings are 0, the other terms are irrelevant. We just need to minimize λ2j=1nuk=1n(θk(j))2\frac{ \lambda }{ 2 }\sum_{ j=1 }^{ n_u }\sum_{ k=1 }^n(\theta_k^{ (j) })^2, which equals λ2[(θ1(5))2+(θ2(5))2]\frac{ \lambda }{ 2 }\Big[(\theta^{ (5) }_1)^2 + (\theta^{ (5) }_2)^2\Big]. We can easily get θ(5)=[00]\theta^{ (5) } = \begin {bmatrix}0 \\ 0\end {bmatrix}. But this conclusion doesn't help, we need a better way to do this.

Mean Normalization

  • Calculate all of the movies' average rating, then set their means to zero.
  • We know that Y=[5500?5??0??40??0054?0050?]Y = \begin {bmatrix}5 & 5 & 0 & 0 & ? \\ 5 & ? & ? & 0 & ? \\ ? & 4 & 0 & ? & ? \\ 0 & 0 & 5 & 4 & ? \\ 0 & 0 & 5 & 0 & ? \end {bmatrix}
  • Then μ=[2.52.522.251.25]Y=[2.52.52.52.5?2.5??2.5??22??2.252.252.751.75?1.251.253.751.25?]\mu = \begin {bmatrix} 2.5 \\ 2.5 \\ 2 \\ 2.25 \\ 1.25 \end {bmatrix} \to Y = \begin {bmatrix}2.5 & 2.5 & -2.5 & -2.5 & ? \\ 2.5 & ? & ? & -2.5 & ? \\ ? & 2 & -2 & ? & ? \\ -2.25 & -2.25 & 2.75 & 1.75 & ? \\ -1.25 & -1.25 & 3.75 & -1.25 & ? \end {bmatrix}
  • And we can initialize θ(5)=[00000]\theta^{ (5) } = \begin {bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \end {bmatrix}
  • This makes sense. Because, if Eve hasn't rated any movies, predict the average rating should be the best.
  • Don't forgot that, for predicting user jj on movie ii, we should plus the mean: (θ(j))(x(i))+μi(\theta^{ (j) })(x^{ (i) })+\mu_i

results matching ""

    No results matching ""