Reinforcement Learning Part 5: The Bellman Optimality Equations

Up until now, we’ve been focused on evaluating the environment or policy: how much reward can we expect to receive from a particular state or by taking a particular action. We still have not touched the “learning” part of reinforcement learning (RL). In other words, we want to find what is the best possible policy to maximize our total rewards.

In this post, we are going to show that optimal policies exist through the Bellman optimality equations. However, note that these equations are only numerically solvable for the most basic environments and agents. In future posts, we will examine various algorithms that attempt to approximate these optimality equations to arrive at a policy that either guarantees maximum returns or offers local maxima good enough for our needs.

What Does Optimal Mean?

Recall from the previous post that we established two different Bellman equations. One for calculating the expected return from a given state s:

\(v_{\pi}(s) = \sum\limits_{a} \pi(a|s) \sum\limits_{s’,r} p(s’, r | s, a) \left[ r + \gamma v_{\pi}(s’) \right]\)

And another for calculating the expected return after having taken a particular action a from a state s:

\(q_{\pi}(s,a) = \sum\limits_{s’,r} p(s’, r | s, a) \left[ r + \gamma \sum\limits_{a’} \pi(a’|s’) q_{\pi}(s’, a’) \right]\)

In both cases, we assumed some arbitrary policy π(a|s), which says to take action a from state s. This could be anything: choose a random action regardless of state, always choose left in a cartpole simulation, etc. Certainly, there has to be at least one policy that achieves the highest possible expected return from every possible state. This is known as the optimal policy. Formally, we express this as:

\(\pi^* = \arg\max\limits_{\pi} v_{\pi}(s) \quad \forall s \in \mathcal{S}\)

In other words, the optimal policy, π*, is the policy for which v(s) (the state-value function) is maximized under that policy for all possible states in the environment.

Note that an optimal policy always exists for any finite Markov Decision Process (MDP), which we covered in part 3. However, this policy might not be unique, which means there could be more than one policy that achieves the same optimal value.

Episodes can run indefinitely (such as our cartpole task), where there is no fixed time limit, and finding an optimal policy for such continuing tasks requires special consideration. The optimal policy is one that keeps the pole upright for as long as possible from any starting state (excluding the “fallen” terminal state). Any policy that does not accomplish this goal is considered to be “suboptimal.”

Optimal Value Functions

Knowing that an optimal policy exists for a given environment and reward function, then we say that some optimal state-value function must also exist. In other words, we can calculate the maximum discounted return from any state s by following the optimal policy π*. We denote this maximum (optimal) state-value function as v*(s), and it is defined as follows:

\(v^*(s) = \max\limits_{\pi} v_{\pi}(s)\)

This just tells us that you can expect the best possible (discounted) return from any given state s if you act perfectly (i.e. following the optimal policy π*) from that point onward. To give a slightly more concrete example, if you start from the “upright” position on our cartpole problem, v*(upright) gives you the maximum discounted return possible from that state (as it assumes the agent is following the optimal policy).

Similarly, we can also conclude that an optimal action-value function must also exist that gives the maximum discounted return after taking action a from state s:

\(q^*(s,a) = \max\limits_{\pi} q_{\pi}(s,a)\)

As with v*(s), q*(s,a) tells us that you can expect the best possible (discounted) return after the state-action pair (s,a) if you act perfectly according to the optimal policy π*. For cartpole, let’s say we take the “push left” action from the “upright” position. q*(upright, push left) then gives us the maximum possible discounted reward from that point on (assuming we follow the optimal policy).

Note that if you know q*(s,a), you can derive the optimal policy directly by simply choosing the action that maximizes q*.

\(\pi^*(s) = \arg\max\limits_{a} q^*(s,a)\)

Recall that the policy π(a|s) gives the probability of choosing action a given state s. Once we’ve chosen an optimal policy, that probability goes to 1. In other words, the optimal policy is deterministic, so π*(s) just gives us the action to choose from state s.

This definition of the optimal policy is quite powerful: if we know q*, we can figure out which action will maximize that value. This forms the basis of Q-learning, which we’ll cover in a future post. To find v*(s), we take the Bellman expectation equation from post 4 and replace the policy-weighted sum over actions with choosing the action that maximizes the discounted future return. The optimal policy will always choose the best action rather than averaging over possibilities.

The Bellman Optimality Equation for v*

Using the Bellman equation for v_π, we can figure out the optimal value from that state given the optimal policy (rather than an arbitrary one). We established that the optimal policy π* is deterministic, so we no longer need to sum over all possible actions. Rather, we look at all the possible actions and choose the one that maximizes the future discounted reward from that state. Mathematically, this looks like the following:

\(v^*(s) = \max\limits_{a} \sum\limits_{s’,r} p(s’, r | s, a) \left[ r + \gamma v^*(s’) \right]\)

Just like our Bellman equation for v_π, this formula is also recursive. It requires knowing the best action (max_a) from the next state (s’), which we express as v*(s’).

With our cartpole example, let’s say we are in the “lean left” state. To find the optimal state value, we just need to know which action (from that state) gives us the best value. Can you guess how we might do that? We calculated some example q values in part 4, so we’ll use those.

By looking at the action-value example in part 4, we came up with the following q values from the “lean left” state:

q(lean left, push left) = 3.41
q(lean left, push right) = ? (unknown, as we assume the policy never chose this option from this state and we never sampled any experience from that state-action pair)
q(lean left, hold) = 1.98

All we need to do is choose the action that gives the next best reward and expected future discounted return. In other words, q(lean left, push left) is the highest value that meets that criteria.

Note that these q values were calculated under our example policy from part 4, not the optimal policy. However, the principle still holds: under the optimal policy, you would calculate q*(s,a) for all actions and choose the one with the highest value. The mechanism is the same; we just need the right q values. This can help lead us to developing or solving for the optimal policy π*, and the Bellman optimality equation for v_π will still hold: the optimal policy will choose the action that maximizes q(s,a) from that state.

The Bellman Optimality Equation for q*

As we did for v*, can derive the Bellman optimality equation for q*(s,a):

\(q^*(s,a) = \sum\limits_{s’,r} p(s’, r | s, a) \left[ r + \gamma \max\limits_{a’} q^*(s’, a’) \right]\)

Recall that q(s,a) assumes that we’ve already taken action a from state s. So, we just need to sum over all possible rewards and transitions to the next state s’, weighted by their respective probabilities. However, once we reach state s’, we then should choose the following action (a’) that maximizes future discounted rewards. Once again, this choice is governed by the optimal policy π*, which will deterministically choose a’ (from state s’) that maximizes q(s’,a’).

As with the Bellman equation for v_π, this optimality equation is also recursive. While that might seem pointless, it provides the mathematical basis for showing how the optimal policy will choose the action (from any state) that will maximize the total future discounted return. We rely heavily on the notions of state and action values along with the definition of an optimal policy to construct the RL algorithms in future posts.

Here’s another way to think about this: as v*(s) is essentially choosing the action from state s that maximizes the future discounted return (i.e. q(s,a)), we can substitute in that definition to write q*(s,a) in terms of v*(s):

\(q^*(s,a) = \sum\limits_{s’,r} p(s’, r | s, a) \left[ r + \gamma v^*(s’) \right]\)

We can simplify this even further:

\(v^*(s) = \max\limits_{a} q^*(s,a)\)

This is the same relationship between q and v that we saw in post 4, just with the optimal versions.

Solving the Bellman Optimality Equations

It’s important to note that the Bellman optimality equations look like something we could just plug into some kind of linear system solver. For trivial systems where the dynamics are known, this might be possible. However, this breaks down for a few reasons in most real-world RL systems.

First, the state space is massive or unknown. Even a simple robot might have hundreds of continuous state dimensions, making computation nearly impossible. Second, we don’t know the transition probabilities p(s’|s,a). These environment dynamics must be learned or estimated from experience. Finally, even if the transition probabilities are known, the computational cost of iterating over all states and actions grows to intractable levels quickly.

In future posts, we’ll look at a few of the most popular or foundational RL algorithms. Rather than solving the Bellman optimality equations directly, they often attempt to approximate the solutions using experience sampled from the environment. Some of the concepts we’ll cover in upcoming posts include:

Monte Carlo methods estimate returns by averaging over complete episodes
TD learning updates value estimates after each step using a “bootstrapped” estimate, which means using your current estimate of future value to update your current estimate (rather than waiting for the actual return)
Q-learning uses TD concepts to approximate q*
Policy gradient methods optimize the policy directly without estimating value functions

As you can see, we rarely attempt to solve the Bellman optimality equations directly. With most RL approaches, we try to estimate or work around these equations to ultimately achieve the goal of finding the optimal policy (or at least a policy good enough to meet our needs).

Conclusion

We defined the Bellman optimality equations for v(s) and q(s,a), which we denote as v*(s) and q*(s,a). With these equations, we are essentially trying to find the value from a particular state (v*(s)) or after having taken action (q*(s,a)) given the optimal policy π*. Stated another way, the optimal policy is the one that chooses the action that maximizes q*(s,a) from state s.

While we showed the existence of an optimal policy for any given MDP, it’s important to understand that we rarely solve for this policy or the associated Bellman optimality equations directly. Rather, we will attempt to approximate or work around these often intractable formulas as we cover a handful of RL algorithms in future posts.

We’ll start by discussing Monte Carlo methods in the next post. If you have any questions or are confused by anything covered, please leave a comment!