Generalized Advantage Estimator Explained

I have notice that the low level explanaition of the Generalized Advantage Estimator lacks the mathematical explanation of how to derive it in the original paper. More over the paper lacks some steps that are interesting to know. I will try to explain the low level derivation and give a quick explanaition of it. Most of this explanaition is taken from my Master thesis 1.


High-level explanation

The main idea of Generalized Advantage Estimator (GAE) is to produce an estimator with significant lower variance at the cost of adding some bias. This estimator can be used to update the parameters and thanks to its lower variance, it improves learning.

The main form of the GAE that is presented in the paper2, is:

ˆAGAE(γ,λ)t=l=0(γλ)lδVt+l^AGAE(γ,λ)t=l=0(γλ)lδVt+l

with δVi=V(si)+ri+γV(si+1)δVi=V(si)+ri+γV(si+1), V(st)V(st) is the value function, riri is the reward in the ii-th time step and γ,λγ,λ are the discount factors (using the GAE we have to parameters to control the discount). But personaly, I think it is easier to understand using the following form (I will proof how to derive this later on in this post) 13.

ˆAGAE(γ,λ)t=Rt(λ)V(st)^AGAE(γ,λ)t=Rt(λ)V(st)

This last equation looks much more simple and in deed it is. Rt(λ)=(1λ)Tn=0λnR(n)tRt(λ)=(1λ)Tn=0λnR(n)t is the λλ-return and R(n)t=n1l=0γlrt+l+γnV(st+1)R(n)t=n1l=0γlrt+l+γnV(st+1) is the nn step return. Using the nn step return allows us to trade between variance and bias. The nn step return has lower variance when nn is smaller because utilizes the reward riri fewer times (n1n1 times) but also discounts fewer times V(s)V(s) thus it has higher bias (Sutton 4 dive deeper into this). For instance, when n=n= we have R()t=rt+γrt+1+γ2rt+2+...R()t=rt+γrt+1+γ2rt+2+..., therefore we do not have bias but the variance is high. when n=0n=0 we do not have variance but we have very high bias.

Rt(λ)Rt(λ) uses all the nn step returns, so we cannot use the nn parameter but varying λλ we can mange the trade-off between bias and variance. λλ works as a discount factor, with lower λλ we discount more the higher nn returns, therefore the effect of having high variance reduces and the effect of low bias also reduces. The opposite happens for a higher λλ.

Derivation of an unbiased estimator

Usually, the GAE is used for the policy gradients. The usual form of the gradient is

g=E[t=0ψtθlogπθ(at|st)]

where ψt is the estimator. There are different estimators for example rt,rtb(st) or Qπ(st,at). But using ψt=Aπ(at,st) gives the lowest possible varience. Schulman et al. work under the objective function of maximizing the returns not the discounted returns. Therefore they introduce the discount factor γ as a method to reduce variance not as a discount factor. Including the variance reduction γ we can write the gradients as

gγ=Es0:,a0:[t=0Aπ,γ(st,at)θlogπθ(at|st)]

With Aγ,π as the advantage using the discounted return. Next we proof that an advantage function using γ does not introduce any bias. Let’s define a γ-just estimator. ˆA is γ-just if it does not add bias, in our case Aπ,γ just adds the effect of variance reductions due to γ. The formal definition is that ˆAt is γ-just if

Es0:a0:[t=0ˆAt(s0:,a0:)θlogπθ(at|st)]=gγ

Schulman et al. proof that if ˆAt(s0:,a0:)=Qt(st:,at:)b(s0:t,a0:t1) such that Est+1:,at+1:st,sa[Qt(st:,at:)]=Qπ,γ(st,at) then ˆAt is γ-just, this is the proposition 1 (see proof below).

Derivation of the functional form of the GAE

Here I show how to derive the first expression of the GAE as Schulman et al. do it in the paper.

We difine δVt=rt+γV(st+1)V(st), which can be considered an estimator of the advantage (Qγ=rt+γV) and based on it we define ˆA(k)t as:

ˆA(1)t=δVt=V(st)+rt+γV(st+1)ˆA(2)t=δVt+γδVt+1=V(st)+rt+γV(st+1)γV(st+1)+γrt+1+γ2V(st+2)=V(st)+rt+γrt+1+γ2V(st+2)ˆA(3)t=δVt+γδVt+1+γ2δVt+1=V(st)+rt+γrt+1+γ2rt+2+γ3V(st+3)ˆA(k)t=k1l=0γlδVt+l=V(st)+k1l=0γlrt+l+γkV(st+k)

Using the definition of ˆA(k)t we define the Generalized Advantage Estimator ˆAGAE(γ,λ)t as

ˆAGAE(γ,λ)t=(1λ)(ˆA(1)t+λˆA(2)t+λ2ˆA(3)t+...)=(1λ)(δVt+λ(δVt+γδVt+1)+λ2(δVt+γδVt+1+γ2δVt+2)+...)=(1λ)(δVt(1+λ+λ2+...)+λγδVt+1(1+λ+λ2+...)+...)=(1λ)(δVt(11λ)+λγδVt+1(11λ)+...)=l=0(γλ)lδVt+l

QED.

Simplification of functional form of the GAE

Here I show how to express

l=0(γλ)lδVt+l

as

Rt(λ)V(st)

This derivation is originally in my master thesis, appendix C1.

From the section above, we can express the GAE as

ˆAGAE(γ,λ)t=(1λ)(ˆA(1)t+λˆA(2)t+λ2ˆA(3)t+...)

From this definition we have

ˆAGAE(γ,λ)t=(1λ)(ˆA(1)t+λˆA(2)t+λ2ˆA(3)t+...)=(1λ)([V(st)+0l=0γlrt+l+γ1V(st+1)]+λ[V(st)+1l=0γlrt+l+γ2V(st+2)]+λ2[V(st)+2l=0γlrt+l+γ3V(st+3)]+...)=V(st)+(1λ)k=0λk[kl=0γlrt+l]+(1λ)k=1λk1[γkV(st+k)]=V(st)+(1λ)k=0λk[kl=0γlrt+l+γk+1V(st+k+1)]=V(st)+(1λ)k=0λkR(k)t=V(st)+Rt(λ)=Rt(λ)V(st)

QED.

Proof of proposition 1

Here I show the proof that if

ˆAt(s0:,a0:)=Qt(st:,at:)b(s0:t,a0:t1)

such that

Est+1:,at+1:|st,sa[Qt(st:,at:)]=Qπ,γ(st,at)

then ˆAt is γ-just2. This is based on 1 and the original paper 2 but I develop all the steps, if you find some error please let me know. Because ˆAt can be expressed in two terms we examine each independently.

By linearity of expectation we have

Es0:a0:[θlogπθ(at|st)ˆAt(s0:,a0:)]=Es0:a0:[θlogπθ(at|st)Qt(s0:,a0:)]Es0:a0:[θlogπθ(at|st)bt(s0:t,a0:t1)]

for the first term

=Es0:ta0:t[Est+1:at+1:[θlogπθ(at|st)Qt(s0:,a0:)]]=Es0:ta0:t[θlogπθ(at|st)Est+1:at+1:[Qt(s0:,a0:)]]

by definition we have that Est+1:,at+1:st,sa[Qt(st:,at:)]=Qπ,γ(st,at) therefore

=Es0:t,a0:t[θlogπθ(at|st)Qπ,γ(st,at)]

We know that Es0:t,a0:t[θlogπθ(at,st)Vπ,γ(st)]=0 because:

Es0:t,a0:t[θlogπθ(at|st)Vπ,γ(st)]=Es0:t,a0:t[θπθ(at|st)πθ(at|st)Vπ,γ(st)]=Es0:t,a0:t1[Eat(θπθ(at|st)πθ(at|st)Vπ,γ(st))]=Es0:t,a0:t1[Eat(θπθ(at|st)πθ(at|st))Vπ,γ(st)]=Es0:t,a0:t1[(atπθ(at|st)θπθ(at|st)πθ(at|st))Vπ,γ(st)]=Es0:t,a0:t1[(atθπθ(at|st))Vπ,γ(st)]=Es0:t,a0:t1[(θatπθ(at|st))Vπ,γ(st)]=Es0:t,a0:t1[(θ1)Vπ,γ(st)]=Es0:t,a0:t1[0Vπ,γ(st)]=0

Based on that we can sum zero to the equation we had

=Es0:t,a0:t[θlogπθ(at|st)Qπ,γ(st,at)]0=Es0:t,a0:t[θlogπθ(at|st)Qπ,γ(st,at)]Es0:t,a0:t[θlogπθ(at,st)Vπ,γ(st)]=Es0:t,a0:t[θlogπθ(at|st)(Qπ,γ(st,at)Vπ,γ(st))]=Es0:t,a0:t[θlogπθ(at|st)Aπ,γ(st,at)]

The only thing left to do is to change the subscripts of the expected value. I reach to John Schulman and Pieter Abbeel (two authors of the paper) to understand their exact reasoning to do that but they did not repond to any questions. I am still curious about this, I can point out that taking the expected value over at of the advantage function does not make sense, because EatA(st,at)=0.

For the second term

=Es0:ta0:t1[Est+1:at:[bt(s0:t,a0:t1)θlogπθ(at|st)]]=Es0:ta0:t1[bt(s0:t,a0:t1)Est+1:at:[θlogπθ(at|st)]]=Es0:ta0:t1[bt(s0:t,a0:t1)Est+1:at:(θπθ(at|st)πθ(at|st))]=Es0:ta0:t1[bt(s0:t,a0:t1)Est+1:at+1:(Eatθπθ(at|st)πθ(at|st))]=Es0:ta0:t1[bt(s0:t,a0:t1)Est+1:at+1:(atπθ(at|st)θπθ(at|st)πθ(at|st))]=Es0:ta0:t1[bt(s0:t,a0:t1)Est+1:at+1:(atθπθ(at|st))]=Es0:ta0:t1[bt(s0:t,a0:t1)Est+1:at+1:(θatπθ(at|st))]=Es0:ta0:t1[bt(s0:t,a0:t1)Est+1:at+1:(θ1)]

Because the gradient of a constant is zero then:

=Es0:ta0:t1[bt(s0:t,a0:t1)0]=0

thus

Es0:a0:[θlogπθ(at|st)At(s0:,a0:)]=Es0:ta0:t1[θlogπθ(at|st)Aπ(st,at)]

QED.


  1. Woodcock, M. (2019) Hierarchical Reinforcement Learning for Robustness, Performance and Explainability. University of Edinburgh (under review).  2 3 4

  2. Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.1   2 3

  3. Peng, X. B., Abbeel, P., Levine, S., & van de Panne, M. (2018). Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4), 143.Chicago. 

  4. Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction.2017. MIT press, second edition, 2018.