On-policy v.s. Off-policy
- On-policy:要学习的agent和与环境交互的agent是同一个agent;相当于自己玩自己学
- Off-policy:要学习的agent和与环境交互的agent不是同一个agent;相当于看别人玩自己学
On-policy \(\rightarrow\) Off-policy
\(\nabla \bar R_\theta = E_{\tau \sim p_\theta(\tau)}[R(\tau)\nabla \log p_\theta(\tau)]\)
- On-policy的问题:使用\(\pi_\theta\)收集数据,当\(\theta\)更新后,需要再采样训练数据
- 目标:希望使用从\(\pi_{\theta'}\)采样的数据训练\(\theta\),固定\(\theta'\),这样就可以重用采样的数据了
Importance Sampling
假设\(x^i\)是从\(p(x)\)中采样得到的,\(E_{x\sim p}[f(x)] \approx \frac{1}{N}\sum_{i=1}^N f(x^i)\) ,
但我们只能从\(q(x)\)中采样\(x^i\),因此无法代入上式,因此进行修正:
\(E_{x\sim p}[f(x)] = \int f(x)p(x)dx = \int f(x) \frac{p(x)}{q(x)}q(x)dx = E_{x \sim q}[f(x)\frac{p(x)}{q(x)}]\)
Importance Sampling的问题
虽然\(E_{x\sim p}[f(x)]=E_{x \sim q}[f(x)\frac{p(x)}{q(x)}]\),但是经推导:
\(Var_{x\sim p}[f(x)]=E_{x\sim p}[f(x)^2]-(E_{x\sim p}[f(x)])^2\)
\(Var_{x\sim p}[f(x)\frac{p(x)}{q(x)}]=E_{x\sim p}[f(x)^2\frac{p(x)}{q(x)}]-(E_{x\sim p}[f(x)])^2\)
二者的方差并不一样,而不一样的原因主要在于\(\frac{p(x)}{q(x)}\),因此如果\(p(x)\)和\(q(x)\)差距很大的话,就会导致二者的方差相差较大,所以需要\(\frac{p(x)}{q(x)}\)接近于1
故:
\(\nabla \bar R_\theta = E_{\tau \sim p_\theta(\tau)}[R(\tau)\nabla \log p_\theta(\tau)]\)
\(\nabla \bar R_\theta = E_{\tau \sim p_{\theta'}(\tau)}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\nabla \log p_\theta(\tau)]\)
从\(\theta'\)采样数据,使用数据训练\(\theta\)多次
梯度变化
之前的Policy Gradient中有
\(E_{(s_t,a_t)\sim\pi_\theta} [A^\theta(s_t,a_t)\nabla \log p_\theta(a_t^n|s_t^n)]\)
因此在Off-policy就有 \[ \begin{align} &E_{(s_t,a_t)\sim\pi_\theta} [\frac{P_\theta(s_t,a_t)}{P_{\theta'}(s_t,a_t)}A^{\theta'}(s_t,a_t)\nabla \log p_\theta(a_t^n|s_t^n)] \\ =&E_{(s_t,a_t)\sim\pi_\theta} [\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}\bcancel{\frac{p_\theta(s_t)}{p_{\theta'}(s_t)}}A^{\theta'}(s_t,a_t)\nabla \log p_\theta(a_t^n|s_t^n)] \end{align} \]
\(J^{\theta'}=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)]\)
但是有一个问题是\(p_\theta(a_t|s_t)\)和\(p_{\theta'}(a_t|s_t)\)不能相差太多,否则结果就不够好
如何避免呢?
PPO/TRPO
PPO:Proximal Policy Optimization(更容易实际使用)
为了保证\(p_\theta(a_t|s_t)\)和\(p_{\theta'}(a_t|s_t)\)不相差太多,因此需要加一个限制,使用KL散度来确保\(\theta,\theta'\)的相似性 \[ J^{\theta'}_{PPO}(\theta) = J^{\theta'}(\theta) - \beta KL(\theta, \theta') \\ J^{\theta'}(\theta) = E_{(s_t,a_t)\sim \pi_{\theta'}}[\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)] \]
注:\(KL(\theta,\theta')\)这里指的是用两个参数得到的输出的概率的差距,而不是参数数值上的差距
TRPO:Trust Region Policy Optimization \[ J^{\theta'}_{TRPO}(\theta) = E_{(s_t,a_t)\sim \pi_{\theta'}}[\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)] \\ KL(\theta,\theta')<\delta \]
PPO
初始化策略参数\(\theta^0\)
在每次迭代中:
使用当前\(\theta^k\)与环境交互来收集数据\(\{s_t,a_t\}\)并计算\(A^{\theta^k}(s_t,a_t)\)
找\(\theta\)优化\(J_{PPO}(\theta)\) \[ J^{\theta^k}_{PPO}(\theta) = J^{\theta^k}(\theta) - \beta KL(\theta, \theta^k) \] 多次更新\(\theta\)
调整\(\beta\)
- 如果\(KL(\theta,\theta^k)>KL_{max}\),就增加\(\beta\)
- 如果\(KL(\theta,\theta^k)<KL_{min}\),就减小\(\beta\)
PPO2
\[ J^{\theta^k}_{PPO2}(\theta) \approx \sum_{(s_t,a_t)} \min(\frac{p_\theta(a_t|s_t)}{p_\theta^k(a_t|s_t)}A^{\theta^k}(s_t,a_t), clip(\frac{p_\theta(a_t|s_t)}{p_\theta^k(a_t|s_t)},1-\epsilon,1+\epsilon)A^{\theta^k}(s_t,a_t)) \]
对于后面项clip的输出,结果如下:
加上前面项后,结果: