Reinforcement Learning Part 8: Temporal-Difference (TD) Learning

Temporal Difference (TD) learning is one of the foundational concepts in reinforcement learning (RL). It combines the notion of updating estimates before the final outcome is known, similar to how dynamic programming (DP) works, with the notion of learning directly from experience, like we saw with the Monte Carlo (MC) methods in part 7.

MC methods required the agent to fully complete an episode in order to update the state- or action-value estimates. With TD learning, we are able to update these value predictions in the middle of an episode, which opens up the possibility of learning on infinitely continuing episodes.

DP required a full model of the system (e.g. transition probabilities) in order to compute the discounted state- and action-values. TD learning, on the other hand, is model-free. It combines the best of both MC and DP methods, and it is the cornerstone to many algorithms we’ll see later in the series, including SARSA, eligibility traces, Q-learning, and PPO.

Incremental Updates

If you recall from our MC algorithms, we waited for an episode to complete and then worked backward through the steps in order to update our q(s,a) estimates for the state s and action a taken at each timestep. We kept a running list of the discounted returns (G) for each of those state-action pairs. We would average the list of returns for each state-action pair to update our Q(s,a) estimate:

\(Q(S_t, A_t) \leftarrow \text{average}(\text{Returns}(S_t, A_t))\)

Another way to compute this exact same update is to get the return at the given timestep t and use it to incrementally update the Q(S_t, A_t) (at that timestep t):

\(Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{1}{N}(G_t – Q(S_t, A_t))\)

Where N is the number of times the state-action pair (s,a) has been visited.

While we did not specifically cover the state-value versions for MC methods, you can rest assured that there is a similar update for v(s_t):

\(V(S_t) \leftarrow V(S_t) + \frac{1}{N}(G_t – V(S_t))\)

Where N is the number of times the state s has been visited. MC methods usually rely on Q(s,a), as it is easier to update the policy once those estimates are known, rather than trying to look ahead to future states v(s’) to choose an action. By using q(s,a) instead of v(s) in MC methods, we can keep the updates model-free.

We can use an arbitrary learning rate α instead of the inverse of the number of visits (1/N) as our coefficient for the update:

\(V(S_t) \leftarrow V(S_t) + \alpha (G_t – V(S_t))\)

Where α is a hyperparameter (chosen by the designer) with a value between 0 and 1. This allows us to control how much each new return shifts our current estimate of V(s). An α close to 0 means that we update v(s) very slowly (the estimate barely changes at each update) whereas an α close to 1 means that the updates to v(s) are large.

While we’ve been talking about using G for MC-style updates where we must wait for the end of the episode to record real returns, we can actually perform updates mid-episode using a technique called bootstrapping.

Bootstrapping

Let’s start with an example to help give some intuition before we show the math. Imagine that each morning, exactly at 7am, you walk outside (coffee in hand) and examine the weather. Right now, it’s Monday morning, and you’re trying to make plans for the upcoming weekend: will the weather hold up for some fun activities outdoors? You have a vague notion of what the weather should be for the weekend during that time of year: about a 50% chance of rain (this is analogous to your initial v(s) estimate before any observations). So, that’s where you start with your prediction.

On this particular Monday morning, the skies are dark with clouds. Knowing that you still have 5 more days to go, you figure that this slightly affects the chance of rain for the weekend, so you update your prediction to 55% chance of rain on the weekend. As the rest of the week rolls on, you note a marked shift in the weather at 7am each day: Tuesday and Wednesday show blue, sunny skies. As a result, you start to shift your prediction to less chance of rain.

As you get closer to the weekend, your morning observations have a greater effect on the prediction. By Thursday, you’ve observed the weather enough to start having more confidence that the weekend will only have about a 35% chance of rain. Better start getting your hiking gear ready!

This is bootstrapping: using observations to update your estimates about future outcomes. When it comes to RL, we use the actual reward received to update our estimates of v(s) for any states visited during a live episode. This allows us to update the estimates at any time during the episode (including after each step), rather than waiting for the episode to end (as we do in MC methods).

To make this work for RL, let’s first recall from part 4 that a full return G_t (from a given timestep t) is the discounted rewards from that step onward (for the rest of the episode):

\(G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots\)

To fully calculate this return, you would need to know all the future rewards, which means waiting until the episode ends (as we do with MC methods). We can rewrite G_t by pulling out just the first term:

\(G_t = R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \cdots) = R_{t+1} + \gamma G_{t+1}\)

Note that term in the parentheses is just G_t+1, which is the discounted future return from the next timestep onward. If we’re in the middle of an episode, we don’t know what that term is exactly (as we have not completed the episode), but we can substitute in the estimated value of the next state:

\(G_t \approx R_{t+1} + \gamma V(S_{t+1})\)

This is known as the bootstrapped target. We use our current estimate of the next state’s value as a substitute for the true future return. Just like checking the weather on Monday, Tuesday, etc. to update our weekend prediction, we’re using what we know right now (R_t+1 and the estimated V(S_t+1)) rather than waiting for the full outcome.

Keep in mind that R_t+1 is the reward we receive after we take an action in the current state. That means we use the bootstrapped target to update the state we just left. Here is the sequence of events to make this possible:

At timestep t, the agent in state S_t chooses and takes action A_t
At timestep t+1, the environment returns reward R_t+1 and moves agent into state S_t+1
The agent can now retroactively update V(S_t) to estimate the value of the state it just left

We can start with our incremental state value update formula that we derived in the previous section:

\(V(S_t) \leftarrow V(S_t) + \alpha (G_t – V(S_t))\)

We then substitute our bootstrap target in for G_t:

\(V(S_t) \leftarrow V(S_t) + \alpha(R_{t+1} + \gamma V(S_{t+1}) – V(S_t))
\)

We update our estimate of the state we just left (V(S_t)) by taking our current estimate of V(S_t) and adding the difference between the reward we just received plus the discounted estimated return of the next state and our current estimate of V(S_t) (multiplied by some learning rate α).

Recall from part 4 that the definition of the value of a given state (v_π(s)) is given as the expected value for the discounted return from that state, with actions taken according to policy π:

\(v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]\)

Rather than calculate a weighted sum over all possible actions and transitions (as we do for the Bellman equation for v_π), TD learning approximates this expectation with one sample at a time. We pull out a single reward (the reward received going from S_t to S_t+1) and use the actual value to update our estimate of the expected value:

\(v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s]\)

Over many, many runs, episodes and visits to each state, the bootstrapped estimation of v(s) should converge to the true state value as determined by the Bellman equation. This is similar to our MC approach, but we don’t have to wait for the episode to end to get there.

Take special note of the term inside the parentheses (the part multiplied by α). This is known as the TD error, and we’ll give it special attention in the next section.

TD Error

The temporal-difference error (TD error) is the difference between the bootstrapped target and our current estimate of the state we just left (V(S_t)). It is given by the delta symbol, often with the subscript t to denote the difference at that particular timestep:

\(\delta_t = R_{t+1} + \gamma V(S_{t+1}) – V(S_t)\)

The TD error (δ_t) gives us a measure of how off the updated prediction was from our original estimate of the previous state value (V(S_t)). Put another way, the TD error gives a measurement of how “surprised” the agent was by what just happened:

δ_t > 0: the reward was higher than expected, so we increase our estimate of V(S_t)
δ_t = 0: our estimate of V(S_t) was correct, so no update needed
δ_t < 0: the reward was lower than expected, so we decrease our estimate of V(S_t)

We can rewrite the update rule to become:

\(V(S_t) \leftarrow V(S_t) + \alpha \delta_t\)

Note that the TD error requires an estimation of the next state’s value V(S_t+1). In addition to the reward changing our estimate of V(S_t), the estimated V(S_t+1) can also change our estimate of V(S_t) in subsequent visits.

TD learning has an interesting parallel in neuroscience: dopamine neurons act as a biological implementation of TD learning. These are the nerve cells responsible for releasing dopamine, the neurotransmitter associated with learning rewards and motivation. They fire (release dopamine) regularly, at a low, steady rate, in the absence of any reward. When a reward is better than expected (δ_t > 0), they fire at a rate above their baseline, releasing more dopamine into the brain. When a reward is worse than expected (δ_t < 0), they fire at a lower rate, suppressing the amount of dopamine we might be used to.

I recommend checking out this blog post by the Google DeepMind team to learn more about this fascinating connection between RL and biology: Dopamine and temporal difference learning: A fruitful relationship between neuroscience and AI.

TD error forms a foundational concept for many RL algorithms going forward. It allows us to update our estimates in real-time as well as keep track of how our policy evaluation estimation is progressing.

Let’s take a moment to talk about two algorithms, TD(0) and SARSA, that use the TD error to evaluate and update their policies. Both are great for seeing how TD learning works in practice, but modern RL algorithms (like Q-learning and PPO, which we’ll cover in future posts) build on TD learning in ways that make up for their limitations.

TD(0) Algorithm

TD(0), also called “one-step TD,” is the simplest application of TD learning. It estimates V(S)for a given policy by updating it after every single step:

Initialize:
    V(s) = 0 for all s ∈ S
    α: learning rate (e.g. 0.1)
    π: fixed policy to evaluate

Loop forever (for each episode):
    Initialize S₀
    Loop for each step t = 0, 1, 2, ...:
        Choose A_t using policy π
        Take action A_t, observe R_t+1 and S_t+1
        V(S_t) ← V(S_t) + α(R_t+1 + γV(S_t+1) - V(S_t))
        S_t ← S_t+1
        If S_t+1 is terminal, end episode

Keep in mind that terminal states have no value, as the agent cannot take an action from a terminal state and therefore cannot receive a reward. Since V(S_T) = 0, the last update simplifies to V(S_T-1) ← V(S_T-1) + α(R_T – V(S_T-1)).

TD(0) is on-policy, which means it learns directly from experience under whatever policy π is being followed. It is also model-free, as the transition probabilities are not required.

Because TD(0) estimates v(s) and not q(s,a), you cannot simply do greedy policy improvement from v(s) without knowing the transition probabilities. So, while it is technically model-free, it is only useful for policy evaluation but not direct policy improvement. To use TD learning for policy improvement, we need to estimate q(s,a) instead, which is what SARSA does.

SARSA

The name for SARSA comes from looking at the first letters involved in a rollout or trajectory: State, Action, Reward, next State, next Action. It calculates the TD error for q(s,a) instead of v(s), which allows us to update the policy while evaluating it at the same time.

The update for a particular Q(S,A) is similar to the one we saw for V(S) in TD(0):

\(Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) – Q(S_t, A_t))\)

The TD error for SARSA is therefore:

\(\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) – Q(S_t, A_t)\)

The formal algorithm for SARSA then becomes:

Initialize:
    Q(s,a) = 0 for all s ∈ S, a ∈ A
    α: learning rate (e.g. 0.1)
    ε: small positive number (e.g. 0.1)

Loop forever (for each episode):
    Initialize S₀
    Choose A₀ using ε-greedy policy from Q
    Loop for each step t = 0, 1, 2, ...:
        Take action A_t, observe R_t+1 and S_t+1
        Choose A_t+1 using ε-greedy policy from Q
        Q(S_t, A_t) ← Q(S_t, A_t) + α(R_t+1 + γQ(S_t+1, A_t+1) - Q(S_t, A_t))
        S_t ← S_t+1, A_t ← A_t+1
        If S_t+1 is terminal, end episode

Note that when S_t+1 is terminal, we cannot take an action from there, so Q(S_t+1, A_t+1) = 0. As a result, the last update simplifies to Q(S_t, A_t) ← Q(S_t, A_t) + α(R_T – Q(S_T-1, A_T-1)).

In order to get the Q value for the next state-action pair, the algorithm needs to know what that next action A_t+1 will be. This next action is chosen using the same epsilon-greedy policy that’s currently being improved. As a result, like TD(0), SARSA is considered to be on-policy.

SARSA falls under generalized policy iteration (GPI) that we talked about in part 7. We update our evaluation estimate (of Q values) after each step, and the policy is implicitly improved by acting greedily with respect to those estimates.

There are a couple of limitations to keep in mind. First, the policy being used to generate the experience must be the same policy being improved (on-policy), which can be less sample efficient than some off-policy agents, as the agent can’t learn from experience generated by a different policy (e.g. one that encourages more exploration). Second, like DP and MC, SARSA stores Q values for every state-action pair, which requires keeping a table for these pairs. This quickly becomes intractable or even impossible for large or continuous state spaces.

We will examine several algorithms in future posts that address these limitations.

Conclusion

TD learning, as we saw, is an important, foundational concept in RL. It provides the backbone for most modern RL algorithms, as it allows us to update our state- or action-value estimates on the fly (i.e. in the middle of an episode, potentially at every step). We no longer need to wait for an episode to finish in order to update those episodes.

We looked at two algorithms that directly use the TD error to update value estimates: TD(0) updates state value estimates (v(s)) at every step, and SARSA updates action value estimates (q(s,a)) at every step. Both are on-policy (they collect experience using the same policy as the one being updated) and model-free (they do not require knowledge of the transition dynamics). However, both require keeping potentially intractable tables, do not work with continuous state spaces, and can be limited by the on-policy nature of the updates.

In the next post, we will look at TD(λ), which uses a weighted combination of n-step returns to interpolate between TD(0) and full MC returns. TD learning has lower variance than MC in most cases (since updates use only one step of randomness rather than a full episode) but higher bias (since the bootstrapped target is only an estimate).

We also look at how eligibility traces give credit (or blame) to specific states for a given TD error. Together, these help build part of the generalized advantage estimator (GAE) used in PPO, which we’ll cover toward the end of the series.

If you have any questions or suggestions, please feel free to leave a comment!