0%

从在线策略到离线策略

On-policy v.s. Off-policy

  • On-policy:要学习的agent和与环境交互的agent是同一个agent;相当于自己玩自己学
  • Off-policy:要学习的agent和与环境交互的agent不是同一个agent;相当于看别人玩自己学

On-policy \(\rightarrow\) Off-policy

\(\nabla \bar R_\theta = E_{\tau \sim p_\theta(\tau)}[R(\tau)\nabla \log p_\theta(\tau)]\)

  • On-policy的问题:使用\(\pi_\theta\)收集数据,当\(\theta\)更新后,需要再采样训练数据
  • 目标:希望使用从\(\pi_{\theta'}\)采样的数据训练\(\theta\),固定\(\theta'\),这样就可以重用采样的数据了

Importance Sampling

假设\(x^i\)是从\(p(x)\)中采样得到的,\(E_{x\sim p}[f(x)] \approx \frac{1}{N}\sum_{i=1}^N f(x^i)\)

但我们只能从\(q(x)\)中采样\(x^i\),因此无法代入上式,因此进行修正:

\(E_{x\sim p}[f(x)] = \int f(x)p(x)dx = \int f(x) \frac{p(x)}{q(x)}q(x)dx = E_{x \sim q}[f(x)\frac{p(x)}{q(x)}]\)

Importance Sampling的问题

虽然\(E_{x\sim p}[f(x)]=E_{x \sim q}[f(x)\frac{p(x)}{q(x)}]\),但是经推导:

\(Var_{x\sim p}[f(x)]=E_{x\sim p}[f(x)^2]-(E_{x\sim p}[f(x)])^2\)

\(Var_{x\sim p}[f(x)\frac{p(x)}{q(x)}]=E_{x\sim p}[f(x)^2\frac{p(x)}{q(x)}]-(E_{x\sim p}[f(x)])^2\)

二者的方差并不一样,而不一样的原因主要在于\(\frac{p(x)}{q(x)}\),因此如果\(p(x)\)\(q(x)\)差距很大的话,就会导致二者的方差相差较大,所以需要\(\frac{p(x)}{q(x)}\)接近于1

故:

\(\nabla \bar R_\theta = E_{\tau \sim p_\theta(\tau)}[R(\tau)\nabla \log p_\theta(\tau)]\)

\(\nabla \bar R_\theta = E_{\tau \sim p_{\theta'}(\tau)}[\frac{p_\theta(\tau)}{p_{\theta'}(\tau)}R(\tau)\nabla \log p_\theta(\tau)]\)

\(\theta'\)采样数据,使用数据训练\(\theta\)多次

梯度变化

之前的Policy Gradient中有

\(E_{(s_t,a_t)\sim\pi_\theta} [A^\theta(s_t,a_t)\nabla \log p_\theta(a_t^n|s_t^n)]\)

因此在Off-policy就有 \[ \begin{align} &E_{(s_t,a_t)\sim\pi_\theta} [\frac{P_\theta(s_t,a_t)}{P_{\theta'}(s_t,a_t)}A^{\theta'}(s_t,a_t)\nabla \log p_\theta(a_t^n|s_t^n)] \\ =&E_{(s_t,a_t)\sim\pi_\theta} [\frac{P_\theta(a_t|s_t)}{P_{\theta'}(a_t|s_t)}\bcancel{\frac{p_\theta(s_t)}{p_{\theta'}(s_t)}}A^{\theta'}(s_t,a_t)\nabla \log p_\theta(a_t^n|s_t^n)] \end{align} \]

\(J^{\theta'}=E_{(s_t,a_t)\sim\pi_{\theta'}}[\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)]\)

但是有一个问题是\(p_\theta(a_t|s_t)\)\(p_{\theta'}(a_t|s_t)\)不能相差太多,否则结果就不够好

如何避免呢?

PPO/TRPO

PPO:Proximal Policy Optimization(更容易实际使用)

为了保证\(p_\theta(a_t|s_t)\)\(p_{\theta'}(a_t|s_t)\)不相差太多,因此需要加一个限制,使用KL散度来确保\(\theta,\theta'\)的相似性 \[ J^{\theta'}_{PPO}(\theta) = J^{\theta'}(\theta) - \beta KL(\theta, \theta') \\ J^{\theta'}(\theta) = E_{(s_t,a_t)\sim \pi_{\theta'}}[\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)] \]

注:\(KL(\theta,\theta')\)这里指的是用两个参数得到的输出的概率的差距,而不是参数数值上的差距

TRPO:Trust Region Policy Optimization \[ J^{\theta'}_{TRPO}(\theta) = E_{(s_t,a_t)\sim \pi_{\theta'}}[\frac{p_\theta(a_t|s_t)}{p_{\theta'}(a_t|s_t)}A^{\theta'}(s_t,a_t)] \\ KL(\theta,\theta')<\delta \]

PPO

  • 初始化策略参数\(\theta^0\)

  • 在每次迭代中:

    • 使用当前\(\theta^k\)与环境交互来收集数据\(\{s_t,a_t\}\)并计算\(A^{\theta^k}(s_t,a_t)\)

    • \(\theta\)优化\(J_{PPO}(\theta)\) \[ J^{\theta^k}_{PPO}(\theta) = J^{\theta^k}(\theta) - \beta KL(\theta, \theta^k) \] 多次更新\(\theta\)

    • 调整\(\beta\)

      • 如果\(KL(\theta,\theta^k)>KL_{max}\),就增加\(\beta\)
      • 如果\(KL(\theta,\theta^k)<KL_{min}\),就减小\(\beta\)

PPO2

\[ J^{\theta^k}_{PPO2}(\theta) \approx \sum_{(s_t,a_t)} \min(\frac{p_\theta(a_t|s_t)}{p_\theta^k(a_t|s_t)}A^{\theta^k}(s_t,a_t), clip(\frac{p_\theta(a_t|s_t)}{p_\theta^k(a_t|s_t)},1-\epsilon,1+\epsilon)A^{\theta^k}(s_t,a_t)) \]

对于后面项clip的输出,结果如下:

加上前面项后,结果: