Reinforcement Learning Part 9: TD(λ) and Eligibility Traces

In the previous post, we saw how temporal difference (TD) learning updated value predictions in the middle of an episode rather than waiting to the very end, like we do with Monte Carlo (MC) methods. If you recall, the TD(0) algorithm updates value estimates after every step using a single reward in order to bootstrap (using observations to update estimates about future outcomes).

In general, TD learning has more bias (compared to MC methods), as the estimate of V(S_t) relies on the estimate of V(S_t+1), which means V(S_t) could get pulled toward an incorrect target (and it often does early in the training sequence). MC methods do not have any bias, as it uses the true returns G_t to compute the expected returns.

However, MC methods have high variance due to G_t being the sum of many random rewards across an episode. Different episodes can produce wildly different G_t values. The more steps in the episode (and number of possible trajectories), the more source of randomness, which increases the variance. TD learning has generally lower variance, as the bootstrapped target (R_t+1 + γV(S_t+1)) involves only one random value: R_t+1. V(S_t+1) is a running average of past experiences, so it’s much more stable than the full G_t returns.

TD(λ) is a reinforcement learning (RL) algorithm that attempts to blend TD(0) and MC methods in order to balance bias and variance. In the rest of this post, we’ll see how TD(λ) uses eligibility traces to perform this balance, which provides the foundation for the generalized advantage estimator (GAE) used in later algorithms, like PPO.

N-Step Returns

In TD(0), we update the V(S_t) estimate after every step. Instead, what if we waited n steps in order to update the value estimates? Let’s start by generalizing the idea of an n step discounted return:

\(G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})\)

Let’s see what happens at different n values:

n=1: G_t¹ = R_t+1 + γV(S_t+1), which looks at only 1 reward before relying on the V(S_t+1) estimate. This is the return target used in the TD(0) algorithm.
n=2: G_t² = R_t+1 + γR_t+2 + γ²V(S_t+2), waits for 2 actual rewards before bootstrapping with the V(S_t+2) estimate
n=∞: G_t^∞ = R_t+1 + γR_t+2 + γ²R_t+3 + …, sums all of the actual discounted rewards looking forward from timestep t, which is just the full MC return

With this in mind, we can adjust our TD update rule for V(S_t) to the more generalized:

\(V(S_t) \leftarrow V(S_t) + \alpha(G_t^{(n)} – V(S_t))\)

By adjusting the value of n, we can control how many actual rewards we need to get in order to update the value estimate at time t. This allows us to trade bias for variance and vice versa.

Higher n means more real rewards and less reliance on potentially wrong value estimates (less bias) but more random rewards summed together (more variance). This pushes us closer to a full MC update. On the other hand, lower n means less variance but more reliance on the bootstrapped estimate (more bias). We can update the value estimates for a previous state after only a few actual rewards.

λ-Weighted Returns

Instead of having to choose a single n value, what if we could take all of the possible return estimates into account and vary how much short-term actual rewards factor in versus long-term actual rewards?

Using our above definitions of G_t¹, G_t², …, G_t^∞, we can create a weighted sum of all those terms to produce the λ-return (also known as the TD(λ) target):

\(G_t^{\lambda} = (1-\lambda) \sum\limits_{n=1}^{T-t-1} \lambda^{n-1} G_t^{(n)} + \lambda^{T-t-1}G_{t}\)

This essentially provides a mixture of single-step bootstrapping, two-step bootstrapping, all the way through full MC calculations. We can vary λ to determine how much emphasis to put on short-term rewards versus waiting for the full actual returns (e.g. MC). With λ=0, we end up with just G_t¹ (the single-step TD(0) estimate). With λ=1, we end up with the full MC return.

Keep in mind that this formula is written for finite episodes only. It breaks down for infinitely long episodes, but it’s useful right now for our purposes. Also, for any value 0<λ<1, we have to wait for the full episode to end so that we can actually use the MC rewards in the blending process to produce G_t^λ at a given timestep t.

In practice, you’ll usually find λ set to something between 0.9 and 0.99, which puts most of the emphasis on the longer term MC values rather than the short-term TD targets.

Eligibility Traces

Recall from part 7 that MC often struggles with assigning credit: if a reward only arrives at the end of a long episode (think of a larger version of our 1D gridworld example where you only got a reward at the very end), then it could take many episodes before the signal worked its way back to influence early state (or action) value estimates. How do we assign credit to actions that potentially happen earlier and contribute meaningfully to the ultimate return?

Let’s consider our other example: cartpole. If the pole starts falling at timestep t, then there’s a good chance that the agent made a poor decision just before then (say, at timestep t-1 or t-2) that deserves more blame than before when the pole was balancing well (at e.g. timestep t-20).

Eligibility traces are a way to accumulate credit for past states based on how recently and how frequently a given state was visited. The basic idea is the eligibility trace accumulates value each time the state is visited, and slowly decreases over time when the state is not being visited. Formally, we can express this as:

\(e_t(s) = \begin{cases} e_{t-1}(s) + 1 & \text{if } s = S_t \\ \gamma\lambda \, e_{t-1}(s) & \text{otherwise} \end{cases}\)

Each state s maintains its own eligibility trace. So, at timestep t, 1 is added to the current state’s trace. All other states (at that timestep) decay by a factor of γ⋅λ (the discount factor times the value chosen for λ). This is known as accumulating traces. In other words, states visited recently and frequently have a high trace value (e(s)), and states that have not been visited recently have a trace near zero.

With eligibility traces defined, we can now modify the TD update rule to use them, distributing the TD error to all states at each timestep rather than just the current one. We modify our TD learning value estimate update rule to use an eligibility trace for each state:

\(V(s) \leftarrow V(s) + \alpha \delta_t e_t(s) \quad \text{for all } s \in S\)

This is called the TD(λ) update. Note that δ_t is the TD error we saw in part 8. We apply this update for all states at each timestep t.

TD(λ) Algorithm

Eligibility Traces can be found in a number of RL algorithms. One of the most popular is TD(λ), which performs a TD(λ) update for all states at every timestep in an attempt to evaluate a policy, π:

Initialize:
    V(s) = 0 for all s ∈ S
    α: learning rate (e.g. 0.1)
    λ: trace decay parameter (e.g. 0.9)

Loop forever (for each episode):
    Initialize S0
    e(s) ← 0 for all s ∈ S
    Loop for each step t = 0, 1, 2, ...:
        Choose At using policy π
        Take action At, observe Rt+1 and St+1
        δt ← Rt+1 + γV(St+1) - V(St)
        e(St) ← e(St) + 1
        For all s ∈ S:
            V(s) ← V(s) + α⋅δt⋅e(s)
            e(s) ← γ⋅λ⋅e(s)
        St ← St+1
        If St+1 is terminal, end episode

As with the TD(0) algorithm, we start with the value estimates for each state at 0. We also must choose a λ value as a hyperparameter. For each episode, we reset all eligibility traces to 0. Then, at each timestep t, we take an action, observe the reward R_t+1, and arrive at the new state S_t+1. We then use this information to update the TD error at that timestep. We also add 1 to the eligibility trace e(s) for that state. We then loop through all states (including the current one) and update their state value estimates using the update rule we showed earlier. We also decay the trace e(s) for each state. We then continue to the next state, and repeat until the episode terminates.

Like TD(0) and SARSA, TD(λ) requires keeping a table of estimated state values for every state. As a result, this can be intractable or impossible for large or continuous state spaces. It acts as a useful demonstration of how we can balance TD(0) and MC methods to estimate state values. In modern deep RL, eligibility traces are not used very often, but the idea is used in the generalized advantage estimator (GAE), which we will explore in a future post.

We won’t spend any more time on TD(λ), as other algorithms accomplish the same goal with more elegance and without the need for maintaining a state table. That being said, let’s take a look at the idea of looking forward versus backward in an episode.

Forward vs. Backward View

You just collected all the data from a rollout. You are working back through the trajectory in order to estimate the value for each state (or state-action pair, if you are working with something like every-visit MC). You are standing at the state for t=T-5 for this particular rollout. In order to estimate the state (or state-action pair), you need to look at all the rewards from T-5 to T-1 (just before the terminal state).

In other words, you estimate the state value by looking forward in time for the rest of the trajectory. This works for MC methods where you have already collected the entire rollout and have access to all rewards that followed the current state.

The eligibility trace updates in TD(λ) are backward looking: after each step, the current TD error is distributed backward to all previously visited states, weighted by their eligibility traces. States visited recently receive more of the credit or blame for the current TD error.

This is a key advantage of the backward view: rather than waiting for the full episode to compute the λ-return (as the forward view requires), the eligibility trace allows us to propagate TD errors backward in real time after every step. While TD(λ) is not widely used in modern deep RL, it gives us the key concepts of eligibility traces and the forward/backward view distinction that we’ll see in more modern algorithms.

TD learning (including TD(0) and SARSA) opens up the possibility of performing online learning, which means the value estimates or policy parameters can be updated as each new piece of experience arrives, even if the episode is still in progress. This stands in contrast to offline learning, which means that you have to wait to collect all possible experiences first (i.e. a full episode) before making any updates to the policy or value estimates.

A possible compromise is to use batch methods where you collect some fixed amount of timesteps (which might include episode terminations and resets), creating a rollout (data collected from this fixed window of timesteps) to be used in value estimate or policy updates. Such algorithms can use forward and backward-looking methods when examining data from a batch collected in this manner. We’ll see batching used for algorithms in future posts.

Conclusion

The TD(λ) algorithm provides us with an important stepping stone on our journey to deep RL, as it demonstrated the ability to combine forward-looking methods (e.g. MC) with backward-looking methods (e.g. TD updates). Recall that with λ=0, TD(λ) is just TD(0), and with λ=1, TD(λ) becomes a fully Monte Carlo algorithm.

In the next part, we will spend some time looking at the seminal Q-learning algorithm, which is considered one of the most important breakthroughs in the field of RL.

If you have any questions or suggestions, please feel free to leave a comment!