0%

Part1-推荐系统

Part1-推荐系统

问题综述

例:预测电影评分

用户使用一到五分来对电影进行评分

电影 Alice Bob Carol Dave
Love at last 5 5 0 0
Romance forever 5 ? ? 0
Cute puppies of love ? 4 0 ?
Nonstop car chases 0 0 5 4
Sword vs. karate 0 0 5 ?

注:

\(n_u\):用户的数量

\(n_m\):电影的数量

\(r(i,j)\):如果用户\(j\)给电影\(i\)评分了,则对应值为1

\(y^{(i,j)}\):只要\(r(i,j)\)被定义时才有用,且表示用户\(j\)给电影\(i\)的评分值

因此,推荐系统的问题就是,给出\(r(i,j)\)\(y^{(i,j)}\)数据,然后去查找没有被评级的电影,并试图预测这些电影的评价星级

基于内容的推荐算法

电影 Alice Bob Carol Dave \(x_1\)
(romance)
\(x_2\)
(action)
Love at last 5 5 0 0 0.9 0
Romance forever 5 ? ? 0 1.0 0.01
Cute puppies of love ? 4 0 ? 0.99 0
Nonstop car chases 0 0 5 4 0.1 1.0
Sword vs. karate 0 0 5 ? 0 0.9

如表,我们假设每个电影都有两个特征\(x_1,x_2\),其中\(x_1\)衡量一部电影为爱情片的程度,\(x_2\)衡量一部电影为动作片的程度。

对于任一用户\(j\),学习参数\(\theta^{(j)}\in \mathbb R^3\),使用\((\theta^{(j)})^Tx^{(i)}\)预测用户\(j\)对电影\(i\)的评分

\(x^{(i)} = \left[ \begin{matrix} 1\\ x_1^{(i)} \\ x_2^{(i)} \end{matrix} \right]\)

综述

\(r(i,j)\):如果用户\(j\)给电影\(i\)评分了,则对应值为1

\(y^{(i,j)}\):只要\(r(i,j)\)被定义时才有用,且表示用户\(j\)给电影\(i\)的评分值

\(\theta^{(j)}\):用户\(j\)对应的参数向量

\(x^{(i)}\):电影\(i\)对应的特征向量

对于用户\(j\)和电影\(i\),预测评分:\((\theta^{(j)})^T(x^{(i)})\)

\(m^{(j)}\):用户\(j\)评分了的电影的数量

为了学习\(\theta^{(j)}\)

\(\min_{\theta^{(j)}}\frac{1}{2m^{(j)}}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2m^{(i)}}\sum_{k=1}^n (\theta_k^{(j)})^2\)

相当于:

\(\min_{\theta^{(j)}}\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n (\theta_k^{(j)})^2\)

优化目标

为了学习\(\theta^{(j)}\)(用户\(j\)的参数):

\(\min_{\theta^{(j)}}\frac{1}{2}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n (\theta_k^{(j)})^2\)

为了学习\(\theta^{(1)},\theta^{(2)},\cdots,\theta^{(n_u)}\)

\(\min_{(\theta^{(1)},\cdots,\theta^{(n_u)})} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2\)

优化算法

\(\min_{(\theta^{(1)},\cdots,\theta^{(n_u)})} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2\)

梯度下降: \[ \theta_k^{(j)}:=\theta_k^{(j)}-\alpha\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}(对于k=0) \\ \theta_k^{(j)}:=\theta_k^{(j)}-\alpha(\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda\theta_k^{(j)})(对于k \ne 0) \]

协同过滤

问题动机

对于上述问题,\(x_1\)\(x_2\)这两个特征,并无法直接得到,而是需要人为去确定,耗时耗力。那么应该怎样更容易得到这些特征呢?

现在,我们稍稍改变一下这个假设,假设我们采访每位用户,得到他们对于爱情类电影的喜欢程度和对于动作类电影的喜欢程度,得到如下结果:

\(\theta^{(1)} = \left[ \begin{matrix}0 \\ 5 \\ 0 \end{matrix} \right],\theta^{(2)}=\left[ \begin{matrix}0 \\ 5 \\ 0 \end{matrix} \right],\theta^{(3)}=\left[ \begin{matrix}0 \\ 0 \\ 5 \end{matrix} \right], \theta^{(4)} = \left[ \begin{matrix}0 \\ 0 \\ 5 \end{matrix} \right]\),其中第二个表示喜欢爱情类电影程度,第三个表示喜欢动作类电影程度

这样就可以通过 \[ (\theta^{(1)})^Tx^{(1)} \approx 5 \\ (\theta^{(2)})^Tx^{(2)} \approx 5 \\ (\theta^{(3)})^Tx^{(3)} \approx 0 \\ (\theta^{(4)})^Tx^{(4)} \approx 0 \] 求出\(x^{(1)} = \left[ \begin{matrix}1 \\ 1.0 \\ 0.0 \end{matrix} \right]\)

算法通用与优化

给定\(\theta^{(1)},\cdots,\theta^{(n_u)}\),需要学习\(x^{(i)}\):

\(\min_{x^{(i)}} \frac{1}{2} \sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{k=1}^n(x_k^{(i)})^2\)

给定\(\theta^{(1)},\cdots,\theta^{(n_u)}\),需要学习\(x^{(1)},\cdots,x^{(n_m)}\):

\(\min_{(x^{(1)},\cdots,x^{(n_m)})}\frac{1}{2}\sum_{i=1}^{n_m}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})^2\)

协同过滤

给定\(x^{(1)},\cdots,x^{(n_m)}(和电影评分)\)

​ 可以估计\(\theta^{(1)},\cdots,\theta^{(n_u)}\)

给定\(\theta^{(1)},\cdots,\theta^{(n_u)}\)

​ 可以估计\(x^{(1)},\cdots,x^{(n_m)}\)

即猜定\(\theta\),然后\(\theta \to x \to \theta \to x \to \theta \to x \to \cdots\)

协同过滤算法

协同过滤优化目标

给定\(x^{(1)},\cdots,x^{(n_m)}\),估计\(\theta^{(1)},\cdots,\theta^{(n_u)}\):

\(\min_{(x^{(1)},\cdots,x^{(n_m)})}\frac{1}{2}\sum_{i=1}^{n_m}\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})^2\)

给定\(\theta^{(1)},\cdots,\theta^{(n_u)}\),估计\(x^{(1)},\cdots,x^{(n_m)}\)

\(\min_{(\theta^{(1)},\cdots,\theta^{(n_u)})} \frac{1}{2}\sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2\)

将这两个合并成一个: \[ J(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)})=\frac{1}{2} \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2 \] \(\min_{(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)})}J(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)})\)

这里不再遵循\(x_0=1\)\(x\in \mathbb R^{(n+1)}\)而是\(x \in \mathbb R^n\),\(\theta \in \mathbb R^n\)

协同过滤算法

  1. 初始化\(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)}\)为小的随机值(有点像神经网络训练中的初始化)

  2. 使用梯度下降来最小化\(J(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)})\) \[ x_k^{(i)} := x_k^{(i)}-\alpha(\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})\theta_k^{(j)}+\lambda x_k^{(i)}) \\ \theta_k^{(j)} := \theta_k^{(j)}-\alpha(\sum_{j:r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})x_k^{(i)}+\lambda \theta_k^{(j)}) \\ \]

  3. 给定一个具有参数\(\theta\)的用户和一部具有特征\(x\)的电影,就可以使用\(\theta^Tx\)预测用户对电影的评分

矢量化低轶矩阵分解

电影 Alice Bob Carol Dave
Love at last 5 5 0 0
Romance forever 5 ? ? 0
Cute puppies of love ? 4 0 ?
Nonstop car chases 0 0 5 4
Sword vs. karate 0 0 5 ?

\(Y=\left[ \begin{matrix} 5 & 5 & 0 & 0 \\ 5 & ? & ? & 0 \\ ? & 4 & 0 & ? \\ 0 & 0 & 5 & 4 \\ 0 & 0 & 5 & ? \end{matrix} \right]\)

预测评分:\(\left[ \begin{matrix} (\theta^{(1)})^T(x^{(1)}) & (\theta^{(2)})^T(x^{(1)}) & \cdots & (\theta^{(n_u)})^T(x^{(1)}) \\ (\theta^{(1)})^T(x^{(2)}) & (\theta^{(2)})^T(x^{(2)}) & \cdots & (\theta^{(n_u)})^T(x^{(2)}) \\ \cdots & \cdots & \cdots & \cdots \\ (\theta^{(1)})^T(x^{(n_m)}) & (\theta^{(2)})^T(x^{(n_m)}) & \cdots & (\theta^{(n_u)})^T(x^{(n_m)}) \end{matrix} \right]\)

\(X = \left[\begin{matrix}-(x^{(1)})^T- \\ -(x^{(2)})^T- \\ \cdots \\ -(x^{(n_m)})^T \end{matrix} \right]\)

\(\Theta = \left[\begin{matrix}-(\theta^{(1)})^T- \\ -(\theta^{(2)})^T- \\ \cdots \\ -(\theta^{(n_u)})^T \end{matrix} \right]\)

\(y_{predict} = X\Theta^T\)

这个协同过滤算法也叫做低秩矩阵分解(Low rank matrix factorvization)

找相关的电影

对于每个产品\(i\),我们学习一个特征向量\(x^{(i)} \in \mathbb R^n\)

如果找到与电影\(i\)相关的电影\(j\)

\(||x^{(i)}-x^{(j)}||\)很小$\(电影\)j\(和电影\)i$相似

5个与电影\(i\)最相似的电影:找5个有最小的距离\(||x^{(i)}-x^{(j)}||\)的电影\(j\)

实现细节:均值规范化

例:假设有一个没有给任何电影评分的用户,加上之前所说的四个用户

那么,协同过滤算法会对这个没评分的用户做什么?

\(\min_{(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)})}\frac{1}{2} \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2+\frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n(x_k^{(i)})^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2\)

假设\(n=2\),即我们需要学习两个特征变量

\(\theta^{(5)} \in \mathbb R^2\)

由于该用户没有评分,所以:\(\frac{1}{2} \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)}-y^{(i,j)})^2=0\)

因此,\(\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2\)是影响\(\theta^{(5)}\)的唯一项,也就是说,我们需要找一个\(\theta^{(5)}\)来使该项尽可能小

即:\(\frac{\lambda}{2}((\theta_1^{(5)})^2+(\theta_2^{(5)})^2)\)

这样得到的就会是\(\theta^{(5)} = \left[ \begin{matrix} 0 \\ 0 \end{matrix} \right]\),进而预测的结果都会是0,所以这种结果不太好

均值规范化

均值规范化可以用来解决上述问题

\(Y=\left[ \begin{matrix} 5 & 5 & 0 & 0 &? \\ 5 & ? & ? & 0 &? \\ ? & 4 & 0 & ? &? \\ 0 & 0 & 5 & 4 &? \\ 0 & 0 & 5 & ? &? \end{matrix} \right]\)

计算每个电影评分的均值:\(\mu = \left[ \begin{matrix} 2.5 \\ 2.5 \\ 2 \\ 2.25 \\ 1.25 \end{matrix} \right]\)

则减去均分后:\(Y=\left[ \begin{matrix} 2.5 & 2.5 & -2.5 & -2.5 &?\\ 2.5 & ? & ? & -2.5 &? \\ ? & 2 & -2 & ? &? \\ -2.25 & -2.25 & 2.75 & 1.75 &? \\ -1.25 & -1.25 & 3.75 & ? &? \end{matrix} \right]\)

当进行电影评分预测时,步骤如下:

对于用户\(j\),在电影\(i\)上评分预测:

\((\theta^{(j)})^T(X^{(i)})+\mu_i\)

对于用户5,则:

\(\theta^{(5)} = \left[ \begin{matrix} 0 \\ 0 \end{matrix} \right]\)\(y_{predict}=(\theta^{(j)})^T(X^{(i)})+\mu_i\)