Generalized Advantage Estimator Explained

I have notice that the low level explanaition of the Generalized Advantage Estimator lacks the mathematical explanation of how to derive it in the original paper. More over the paper lacks some steps that are interesting to know. I will try to explain the low level derivation and give a quick explanaition of it. Most of this explanaition is taken from my Master thesis ¹.

High-level explanation

The main idea of Generalized Advantage Estimator (GAE) is to produce an estimator with significant lower variance at the cost of adding some bias. This estimator can be used to update the parameters and thanks to its lower variance, it improves learning.

The main form of the GAE that is presented in the paper², is:

\[\hat A_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^\infty (\gamma\lambda)^l\delta^V_{t+l}\]

with \(\delta^V_{i}=-V(s_i)+r_i+\gamma V(s_{i+1})\), \(V(s_t)\) is the value function, \(r_i\) is the reward in the \(i\)-th time step and \(\gamma,\lambda\) are the discount factors (using the GAE we have to parameters to control the discount). But personaly, I think it is easier to understand using the following form (I will proof how to derive this later on in this post) ¹³.

\[\hat A_t^{\text{GAE}(\gamma,\lambda)} = R_t(\lambda)-V(s_t)\]

This last equation looks much more simple and in deed it is. \(R_t(\lambda)=(1-\lambda)\sum_{n=0}^T\lambda^nR_t^{(n)}\) is the \(\lambda\)-return and \(R_t^{(n)} = \sum_{l=0}^{n-1}\gamma^lr_{t+l} + \gamma^nV(s_{t+1})\) is the \(n\) step return. Using the \(n\) step return allows us to trade between variance and bias. The \(n\) step return has lower variance when \(n\) is smaller because utilizes the reward \(r_i\) fewer times (\(n-1\) times) but also discounts fewer times \(V(s)\) thus it has higher bias (Sutton ⁴ dive deeper into this). For instance, when \(n=\infty\) we have \(R^{(\infty)}_t = r_t + \gamma r_{t+1} + \gamma^2r_{t+2} + ...\), therefore we do not have bias but the variance is high. when \(n=0\) we do not have variance but we have very high bias.

\(R_t(\lambda)\) uses all the \(n\) step returns, so we cannot use the \(n\) parameter but varying \(\lambda\) we can mange the trade-off between bias and variance. \(\lambda\) works as a discount factor, with lower \(\lambda\) we discount more the higher \(n\) returns, therefore the effect of having high variance reduces and the effect of low bias also reduces. The opposite happens for a higher \(\lambda\).

Derivation of an unbiased estimator

Usually, the GAE is used for the policy gradients. The usual form of the gradient is

\[g = \mathbb E \left[ \sum_{t=0}^\infty \psi_t \nabla_\theta \log \pi_\theta (a_t|s_t) \right]\]

where \(\psi_t\) is the estimator. There are different estimators for example \(\sum r_t, \; \sum r_t - b(s_t)\) or \(Q^\pi(s_t,a_t)\). But using \(\psi_t = A^\pi(a_t,s_t)\) gives the lowest possible varience. Schulman et al. work under the objective function of maximizing the returns not the discounted returns. Therefore they introduce the discount factor \(\gamma\) as a method to reduce variance not as a discount factor. Including the variance reduction \(\gamma\) we can write the gradients as

\[g^\gamma = \mathbb E_{s_{0:\infty},a_{0:\infty}} \left[ \sum_{t=0}^\infty A^{\pi,\gamma}(s_t,a_t)\nabla_\theta \log \pi_\theta (a_t|s_t) \right]\]

With \(A^{\gamma,\pi}\) as the advantage using the discounted return. Next we proof that an advantage function using \(\gamma\) does not introduce any bias. Let’s define a \(\gamma\)-just estimator. \(\hat A\) is \(\gamma\)-just if it does not add bias, in our case \(A^{\pi,\gamma}\) just adds the effect of variance reductions due to \(\gamma\). The formal definition is that \(\hat A_t\) is \(\gamma\)-just if

\[\mathbb E_{\substack{s_{0:\infty}\\ a_{0:\infty}}} \left[ \sum_{t=0}^\infty \hat A_t(s_{0:\infty},a_{0:\infty}) \nabla _\theta \log \pi_\theta(a_t | s_t) \right] = g^\gamma\]

Schulman et al. proof that if \(\hat A_t(s_{0:\infty},a_{0:\infty})=Q_t(s_{t:\infty},a_{t:\infty}) - b(s_{0:t},a_{0:t-1})\) such that \(\mathbb E_{s_{t+1:\infty},a_{t+1:\infty}\mid s_t,s_a}[Q_t(s_{t:\infty},a_{t:\infty})] = Q^{\pi,\gamma}(s_t,a_t)\) then \(\hat A_t\) is \(\gamma\)-just, this is the proposition 1 (see proof below).

Derivation of the functional form of the GAE

Here I show how to derive the first expression of the GAE as Schulman et al. do it in the paper.

We difine \(\delta_t^V=r_t+\gamma V(s_{t+1})-V(s_t)\), which can be considered an estimator of the advantage (\(Q^\gamma=r_t+\gamma V\)) and based on it we define \(\hat A_t^{(k)}\) as:

\[\begin{alignat*}{2} \hat A_t^{(1)} &= \delta_t^V & &= -V(s_t) +r_t +\gamma V(s_{t+1})\\ \hat A_t^{(2)} &= \delta_t^V + \gamma \delta_{t+1}^V & &= -V(s_t) +r_t +\gamma V(s_{t+1}) - \gamma V(s_{t+1}) +\gamma r_{t+1} +\gamma^2 V(s_{t+2})\\ & & &= -V(s_t) + r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2})\\ \hat A_t^{(3)} &= \delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+1}^V & & = -V(s_t) + r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3V(s_{t+3})\\ \vdots\\ \hat A_t^{(k)} &= \sum_{l=0}^{k-1} \gamma^l \delta^V_{t+l} & &= -V(s_t) + \sum_{l=0}^{k-1} \gamma^l r_{t+l} + \gamma^k V(s_{t+k}) \end{alignat*}\]

Using the definition of \(\hat A_t^{(k)}\) we define the Generalized Advantage Estimator \(\hat A_t^{\text{GAE}(\gamma,\lambda)}\) as

\[\begin{split} \hat A_t^{GAE(\gamma,\lambda)} =& (1-\lambda) (\hat A_t^{(1)} + \lambda \hat A_t^{(2)} + \lambda^2 \hat A_t^{(3)} + ...)\\ =& (1-\lambda)(\delta_{t}^V + \lambda(\delta_{t}^V + \gamma \delta_{t+1}^V ) + \lambda^2( \delta_{t}^V +\gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V ) + ...)\\ =& (1-\lambda)(\delta_t^V(1 + \lambda + \lambda^2 + ...)+\lambda\gamma\delta_{t+1}^V(1+\lambda+\lambda^2+...)+...)\\ =& (1-\lambda)\left(\delta_t^V\left(\frac{1}{1-\lambda} \right)+\lambda\gamma\delta_{t+1}^V\left(\frac{1}{1-\lambda} \right) + ...\right)\\ =& \sum_{l=0}^\infty (\gamma\lambda)^l\delta_{t+l}^V \end{split}\]

QED.

Simplification of functional form of the GAE

Here I show how to express

\[\sum_{l=0}^\infty (\gamma\lambda)^l\delta^V_{t+l}\]

\[R_t(\lambda)-V(s_t)\]

This derivation is originally in my master thesis, appendix C¹.

From the section above, we can express the GAE as

\[\hat A_t^{GAE(\gamma,\lambda)} = (1-\lambda) (\hat A_t^{(1)} + \lambda \hat A_t^{(2)} + \lambda^2 \hat A_t^{(3)} + ...)\]

From this definition we have

\[\begin{split} \hat A_t^{GAE(\gamma,\lambda)} =& (1-\lambda) (\hat A_t^{(1)} + \lambda \hat A_t^{(2)} + \lambda^2 \hat A_t^{(3)} + ...)\\ =& (1-\lambda) \left( \left[-V(s_t) + \sum_{l=0}^{0} \gamma^l r_{t+l} + \gamma^1 V(s_{t+1})\right] + \right.\\ &\hspace{13mm}\lambda \left[-V(s_t) + \sum_{l=0}^{1} \gamma^l r_{t+l} + \gamma^2 V(s_{t+2})\right] + \\ &\hspace{10mm}\left.\lambda^2 \left[-V(s_t) + \sum_{l=0}^{2} \gamma^l r_{t+l} + \gamma^3 V(s_{t+3})\right] + ...\right)\\ =& -V(s_t) + (1-\lambda) \sum_{k=0}^\infty \lambda^k\left[\sum_{l=0}^{k} \gamma^l r_{t+l}\right] + (1-\lambda) \sum_{k=1}^\infty\lambda^{k-1}\left[\gamma^k V(s_{t+k})\right]\\ =& -V(s_t) + (1-\lambda) \sum_{k=0}^\infty\lambda^k \left[\sum_{l=0}^{k} \gamma^l r_{t+l} + \gamma^{k+1} V(s_{t+k+1})\right]\\ =& -V(s_t) + (1-\lambda) \sum_{k=0}^\infty\lambda^k R^{(k)}_t\\ =& -V(s_t) + R_t(\lambda) = R_t(\lambda) - V(s_t) \end{split}\]

QED.

Proof of proposition 1

Here I show the proof that if

\[\hat A_t(s_{0:\infty},a_{0:\infty})=Q_t(s_{t:\infty},a_{t:\infty}) - b(s_{0:t},a_{0:t-1})\]

such that

\[\mathbb E_{s_{t+1:\infty},a_{t+1:\infty}|s_t,s_a}[Q_t(s_{t:\infty},a_{t:\infty})] = Q^{\pi,\gamma}(s_t,a_t)\]

then \(\hat A_t\) is \(\gamma\)-just². This is based on ¹ and the original paper ² but I develop all the steps, if you find some error please let me know. Because \(\hat A_t\) can be expressed in two terms we examine each independently.

By linearity of expectation we have

\[\begin{multline*} \mathbb E_{\substack{s_{0:\infty}\\ a_{0:\infty}}} \left[ \nabla _\theta \log \pi_\theta(a_t | s_t) \hat A_t(s_{0:\infty},a_{0:\infty}) \right] = \\ \mathbb E_{\substack{s_{0:\infty}\\ a_{0:\infty}}} \left[\nabla _\theta \log \pi_\theta(a_t | s_t) Q_t(s_{0:\infty},a_{0:\infty}) \right] - \mathbb E_{\substack{s_{0:\infty}\\ a_{0:\infty}}} \left[ \nabla _\theta \log \pi_\theta(a_t | s_t) b_t(s_{0:t},a_{0:t-1}) \right] \end{multline*}\]

for the first term

\[\begin{align*} &= \mathbb E_{\substack{s_{0:t}\\ a_{0:t}}} \left[\mathbb E_{\substack{s_{t+1:\infty}\\ a_{t+1:\infty}}} [ \nabla _\theta \log \pi_\theta(a_t | s_t)Q_t(s_{0:\infty},a_{0:\infty}) ] \right]\\ &= \mathbb E_{\substack{s_{0:t}\\ a_{0:t}}} \left[ \nabla _\theta \log \pi_\theta(a_t | s_t)\mathbb E_{\substack{s_{t+1:\infty}\\ a_{t+1:\infty}}} [ Q_t(s_{0:\infty},a_{0:\infty})] \right] \end{align*}\]

by definition we have that \(\mathbb E_{s_{t+1:\infty},a_{t+1:\infty}\mid s_t,s_a}[Q_t(s_{t:\infty},a_{t:\infty})] = Q^{\pi,\gamma}(s_t,a_t)\) therefore

\[\begin{align*} &= \mathbb E_{s_{0:t},a_{0:t}} \left[ \nabla _\theta \log \pi_\theta(a_t | s_t) Q^{\pi,\gamma}(s_t,a_t) \right] \end{align*}\]

We know that \(\mathbb E_{s_{0:t},a_{0:t}} [\nabla_\theta\log\pi_\theta(a_t,s_t)V^{\pi,\gamma}(s_t)] = 0\) because:

\[\begin{align} \mathbb E_{s_{0:t},a_{0:t}} [\nabla_\theta\log\pi_\theta(a_t|s_t)V^{\pi,\gamma}(s_t)] &= \mathbb E_{s_{0:t},a_{0:t}} \left[\frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)}V^{\pi,\gamma}(s_t)\right]\\ &= \mathbb E_{s_{0:t},a_{0:t-1}} \left[\mathbb E_{a_t} \left(\frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)}V^{\pi,\gamma}(s_t) \right) \right]\\ &= \mathbb E_{s_{0:t},a_{0:t-1}} \left[\mathbb E_{a_t} \left(\frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)}\right)V^{\pi,\gamma}(s_t) \right]\\ &= \mathbb E_{s_{0:t},a_{0:t-1}} \left[\left(\sum_{a_t} \pi_\theta(a_t|s_t)\frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)}\right)V^{\pi,\gamma}(s_t) \right]\\ &= \mathbb E_{s_{0:t},a_{0:t-1}} \left[\left(\sum_{a_t}\nabla_\theta \pi_\theta(a_t|s_t)\right)V^{\pi,\gamma}(s_t) \right]\\ &= \mathbb E_{s_{0:t},a_{0:t-1}} \left[\left(\nabla_\theta \sum_{a_t}\pi_\theta(a_t|s_t)\right)V^{\pi,\gamma}(s_t) \right]\\ &= \mathbb E_{s_{0:t},a_{0:t-1}} \left[\left(\nabla_\theta 1\right)V^{\pi,\gamma}(s_t) \right]\\ &= \mathbb E_{s_{0:t},a_{0:t-1}} \left[ 0 \cdot V^{\pi,\gamma}(s_t) \right]\\ &= 0 \end{align}\]

Based on that we can sum zero to the equation we had

\[\begin{align*} &= \mathbb E_{s_{0:t},a_{0:t}} \left[ \nabla _\theta \log \pi_\theta(a_t | s_t) Q^{\pi,\gamma}(s_t,a_t) \right] - 0\\ &= \mathbb E_{s_{0:t},a_{0:t}} \left[ \nabla _\theta \log \pi_\theta(a_t | s_t) Q^{\pi,\gamma}(s_t,a_t) \right] - \mathbb E_{s_{0:t},a_{0:t}} [\nabla_\theta\log\pi_\theta(a_t,s_t)V^{\pi,\gamma}(s_t)]\\ &= \mathbb E_{s_{0:t},a_{0:t}} \left[ \nabla _\theta \log \pi_\theta(a_t | s_t) (Q^{\pi,\gamma}(s_t,a_t)-V^{\pi,\gamma}(s_t)) \right] \\ &= \mathbb E_{s_{0:t},a_{0:t}} \left[ \nabla _\theta \log \pi_\theta(a_t | s_t) A^{\pi,\gamma}(s_t,a_t) \right] \end{align*}\]

The only thing left to do is to change the subscripts of the expected value. I reach to John Schulman and Pieter Abbeel (two authors of the paper) to understand their exact reasoning to do that but they did not repond to any questions. I am still curious about this, I can point out that taking the expected value over \(a_t\) of the advantage function does not make sense, because \(\mathbb E_{a_t} A(s_t,a_t) = 0\).

For the second term

\[\begin{align*} &=\mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[ \mathbb E_{\substack{s_{t+1:\infty}\\ a_{t:\infty}}} [b_t(s_{0:t},a_{0:t-1}) \nabla _\theta \log \pi_\theta(a_t | s_t) ]\right] \\ &=\mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[ b_t(s_{0:t},a_{0:t-1}) \mathbb E_{\substack{s_{t+1:\infty}\\ a_{t:\infty}}} [\nabla _\theta \log \pi_\theta(a_t | s_t) ]\right] \\ &=\mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[ b_t(s_{0:t},a_{0:t-1}) \mathbb E_{\substack{s_{t+1:\infty}\\ a_{t:\infty}}} \left( \frac{\nabla _\theta \pi_\theta(a_t | s_t)}{\pi_\theta(a_t | s_t)} \right)\right] \\ &=\mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[ b_t(s_{0:t},a_{0:t-1}) \mathbb E_{\substack{s_{t+1:\infty}\\ a_{t+1:\infty}}} \left( \mathbb E_{a_t}\frac{\nabla _\theta \pi_\theta(a_t | s_t)}{\pi_\theta(a_t | s_t)} \right)\right] \\ &=\mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[ b_t(s_{0:t},a_{0:t-1}) \mathbb E_{\substack{s_{t+1:\infty}\\ a_{t+1:\infty}}} \left( \sum_{a_t} \pi_\theta(a_t|s_t) \frac{\nabla _\theta \pi_\theta(a_t | s_t)}{\pi_\theta(a_t | s_t)} \right)\right] \\ &=\mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[ b_t(s_{0:t},a_{0:t-1}) \mathbb E_{\substack{s_{t+1:\infty}\\ a_{t+1:\infty}}} \left( \sum_{a_t} \nabla_\theta \pi_\theta(a_t | s_t) \right) \right] \\ &=\mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[ b_t(s_{0:t},a_{0:t-1}) \mathbb E_{\substack{s_{t+1:\infty}\\ a_{t+1:\infty}}} \left( \nabla_\theta \sum_{a_t} \pi_\theta(a_t | s_t) \right) \right] \\ &=\mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[ b_t(s_{0:t},a_{0:t-1}) \mathbb E_{\substack{s_{t+1:\infty}\\ a_{t+1:\infty}}} \left( \nabla_\theta 1 \right) \right] \\ \end{align*}\]

Because the gradient of a constant is zero then:

\[\begin{align*} &=\mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[ b_t(s_{0:t},a_{0:t-1})\cdot 0\right] \\ &=0 \end{align*}\]

thus

\[\mathbb E_{\substack{s_{0:\infty}\\ a_{0:\infty}}} \left[ \nabla _\theta \log \pi_\theta(a_t | s_t)A_t(s_{0:\infty},a_{0:\infty}) \right] = \mathbb E_{\substack{s_{0:t}\\ a_{0:t-1}}} \left[\nabla _\theta \log \pi_\theta(a_t | s_t)A^\pi(s_t,a_t) \right]\]

QED.

Woodcock, M. (2019) Hierarchical Reinforcement Learning for Robustness, Performance and Explainability. University of Edinburgh (under review). ↩ ↩² ↩³ ↩⁴
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.1 ↩ ↩² ↩³
Peng, X. B., Abbeel, P., Levine, S., & van de Panne, M. (2018). Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4), 143.Chicago. ↩
Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction.2017. MIT press, second edition, 2018. ↩