In the previous post, we saw how dynamic programming (DP) could be used to solve the Bellman equations, but they required knowledge of the environment’s transition probabilities. Unfortunately, we do not have that luxury in most real-world reinforcement learning (RL) problems.
In DP, we have a model of the environment, which means that the transition dynamics are either known or can be learned. We can then use this model to evaluate and iterate on the policy. Another class of RL algorithms use model-free algorithms, which means that the agent does not have access to those transition dynamics. Instead, it learns by interacting with the environment and observing what happens. Over many iterations, it is able to estimate the value functions (v(s)) from the experience (rather than mathematical knowledge about the environment).
In this post, we are going to examine how Monte Carlo (MC) algorithms can be used to solve RL problems. These algorithms are considered model-free, as they learn directly from sampled experience without requiring knowledge of the environment’s transition probabilities.
Learning From Experience
In part 4, we defined vπ(s), which is the expected, discounted return from state s under policy π. Rather than calculating this manually (which assumes we have a model of the environment), we can estimate this value by running many episodes and averaging the observed returns from state s.
Monte Carlo simulation (developed by mathematician Stanisław Ulam) involves randomly sampling from a random event or process and averaging those results in order to obtain an estimate of the results or values.
For example, say you wanted to estimate the value of rolling a six-sided die (with faces numbered 1 to 6). A fair die should have an expected value of (1 + 2 + 3 + 4 + 5 + 6) / 6 = 3.5. Instead of computing this value numerically, you could estimate the value using a Monte Carlo simulation (sometimes called an experiment) by physically rolling a die many times and averaging the result. Let’s say we roll a die 5 times:
- Roll 1: 4
- Roll 2: 2
- Roll 3: 6
- Roll 4: 1
- Roll 5: 3
We then average the result: (4 + 2 + 6 + 1 + 3) / 5 = 3.2. This is close to 3.5, but not exactly equal. The law of large numbers states that as the number of independent samples grows, the sample average converges to the true expected value regardless of the underlying probability distribution. Assuming our die is fair, we will eventually converge on the true expected value (3.5) after rolling the die many times (e.g. 10,000+ times).
We can apply this method to RL by estimating the expected return from any state by running many episodes and using the actual returns. As we just showed with the die example, if we run enough episodes, we should converge to Eπ[Gt | St = s] = vπ(s) for each state s. This helps us evaluate a policy without needing a model of the environment (i.e. knowledge of the actual transition probabilities).
The downside is that we need to run actual experiments with the agent and environment. We can’t just compute the optimal policy as if we had known the transition dynamics and rewards. These experiments must run to completion (e.g. terminal states) so that we can calculate a full return (Gt) from each state. Complex environments with many states and transitions will require more experiments.
In RL literature, you’ll often see the term rollout used to mean running experiments in an environment under a particular policy in order to generate a trajectory for the purpose of collecting data (e.g. rewards). Note that “rollout” can sometimes be used to mean a batch of trajectories collected together rather than just one. The rollout data is then used to estimate returns and update a policy.
Note that while we define MC in terms of estimating vπ(s), in practice we usually estimate qπ(s,a) instead, which is the action-value function we covered in part 4. Recall that in part 6, we talked about performing a greedy policy improvement from v(s), which means taking action a from state s that provides the best likely returns. However, this required knowing the transition probabilities (p(s’|s,a)) to figure out which action leads to the next best state.
Since MC is model-free, we do not have those probabilities. As a result, we cannot use v(s) for policy improvement directly. Instead, we just estimate qπ(s,a) by recording the discounted return Gt every time the agent takes action a from state s during a rollout and averaging those returns across many rollouts. This gives us a direct estimate of qπ(s,a) without needing to know the transition probabilities or calculating vπ(s). From there, we can simply improve the policy by choosing the action with the highest q(s,a) from each state s.
Recall our discussion at the end of the previous post where we discussed generalized policy iteration (GPI), which involves moving between evaluating a policy and updating that policy in order to slowly optimize the policy (and potentially model of the environment) over several iterations. We’re going to employ that technique in our MC methods.
Now that we know we want to estimate q(s,a) from rollouts, there’s an important issue we need to address before we can do that reliably: how do we ensure we visit enough states or state-action pairs to sample data?
Exploration Versus Exploitation
Monte Carlo estimates q(s,a) by averaging returns from visited state-action pairs. If the policy never tries a particular action from a particular state, we might never collect experience for that state-action pair, and its q estimate would stay at the initial value forever.
Early in training, a policy is often random and might accidentally avoid several state-action pairs for the entirety of the training process, which leaves us with no data to improve from. This is known as the exploration problem. The agent needs to try enough different actions from different states to build reliable q estimates. We would not know if we’ve settled on the optimal policy if we don’t have enough data.
As a result, RL engineers and researchers attempt to balance exploration versus exploitation. Ideally, early on in a training process, the agent should explore various actions and states, even if they are not necessarily optimal. As training continues and a better picture of the environment (and its rewards) forms, the policy can converge on something more optimal (i.e. exploit its knowledge of the environment).
A common solution is the epsilon-greedy policy (often written as “ε-greedy policy”). Rather than always choosing the action with the highest current q estimate, the agent introduces a small amount of randomness. With some small probability ε, the agent will choose a random action regardless of what the current q estimates say to do. The other percentage (1-ε) of the time, the agent will choose the greedy action (highest q estimate from the current state). This guarantees that, given enough experiments, every state-action pair will be visited.
In practice, ε will be configured to decay over time to encourage high exploration in the beginning of training and high exploitation toward the end of training.
We will use this idea of an ε-greedy policy to demonstrate two main Monte Carlo optimization algorithms: first-visit MC and every-visit MC.
First-Visit Monte Carlo
In first-visit MC, we only use the first time the state-action pair (s,a) is chosen to update the estimate of q(s,a). Subsequent visits to the same (s,a) pair within the same episode are ignored. We generate an episode to get a rollout (also known as an “experiment”) using the current policy. We use this information to estimate the return at that given timestep, working backward. From there, we estimate q(s,a) using those estimated returns.
Here is a formal definition of the algorithm.
Initialize:
Q(s,a) ∈ R, arbitrarily, for all s ∈ S, a ∈ A
Returns(s,a): an empty list, for all s ∈ S, a ∈ A
ε: small positive number (e.g. 0.1)
π: epsilon-greedy policy with respect to Q
Loop forever (for each episode):
Generate an episode following π: S₀, A₀, R₁, S₁, A₁, R₂, ..., ST-1, AT-1, RT
At each step, choose action:
With probability ε: random action from A
With probability 1-ε: argmaxa Q(St, a)
G ← 0
Loop for each step of episode, t = T-1, T-2, ..., 0:
G ← γG + Rt+1
If (St, At) does not appear in (S₀,A₀), (S₁,A₁), ..., (St-1, At-1):
Append G to Returns(St, At)
Q(St, At) ← average(Returns(St, At))
π(St) ← argmaxa Q(St, a)
The policy update (π(St) ← argmaxa Q(St, a)) shows that the agent records the best known action for state St, which is then used as the greedy choice in future episodes.
The ε-greedy behavior happens during execution of the policy during the episode (rollout) rather than the policy update step. This ensures that the policy is updated during the learning phase, but the agent still has an opportunity to explore other q(s,a) options when generating a rollout.
Note that after we perform a rollout, we work backward through the timesteps in order to first estimate the return G at that timestep, use the returns to update the estimate for Q(St, At) (state-action pair at timestep t), and then update the policy for that state. The GPI evaluation/policy update happens at each step.
Every-Visit Monte Carlo
Rather than skipping over subsequent (s,a) visits in the same trajectory, we can keep them in our estimation and update calculations. That gives us the following algorithm (notice the lack of first-visit check in the update section):
Initialize:
Q(s,a) ∈ R, arbitrarily, for all s ∈ S, a ∈ A
Returns(s,a): an empty list, for all s ∈ S, a ∈ A
ε: small positive number (e.g. 0.1)
π: epsilon-greedy policy with respect to Q
Loop forever (for each episode):
Generate an episode following π: S₀, A₀, R₁, S₁, A₁, R₂, ..., ST-1, AT-1, RT
At each step, choose action:
With probability ε: random action from A
With probability 1-ε: argmaxa Q(St, a)
G ← 0
Loop for each step of episode, t = T-1, T-2, ..., 0:
G ← γG + Rt+1
Append G to Returns(St, At)
Q(St, At) ← average(Returns(St, At))
π(St) ← argmaxa Q(St, a)
This method of using every visit to an (s,a) pair during a rollout is better for data efficiency: it takes fewer experimental episodes to gather enough data to find the optimal (or at least good enough) policy. However, it means that multiple returns from the same episode are correlated since they come from the same trajectory, which introduces bias into the estimates. As a result, you will often find first-visit methods used in academic texts as they are easier to analyze. In the real world, data efficiency often outweighs the bias, which means you’ll encounter every-visit methods more often in industry.
Concrete Example
Let’s walk through a concrete example using our 1D gridworld environment from the previous post and every-visit MC. We’ll start with our simple 5-state environment, assume we start in state 2, and start with a uniform random policy that chooses the action left or right with 50% probability. We’ll also choose a discount factor γ=0.9 and an epsilon of ε=0.1.

We don’t know anything about the transition probabilities, so we need to do a few experiments (rollouts) in order to estimate the value functions. We initialize q(s,a) = 0 for all state-action pairs.
Rollout 1
Let’s say we run an experiment in the environment with our uniform random policy, and we get the following trajectory with st as the state, at as the action taken from that state, and rt+1 as the reward received from that state-action pair. Remember that you have a 20% chance of ending up in the opposite state from the one chosen by the agent).
Initial: s0=2, a0=right
t=1: r1=0, s1=3, a1=right
t=2: r2=1, s2=4 (terminal)
It is often more efficient to update the q(s,a) estimates and update the policy after each experiment rather than waiting to collect a bunch of different rollouts under the same policy. As a result, we perform the evaluation and policy update now. Working backward in the trajectory, we update q(s,a) for each state-action pair visited.
t=2: G2 = 0 (terminal, no future rewards)
t=1: G1 = r2 + γ ⋅ G2 = 1 + 0.9 ⋅ 0 = 1.0
t=0: G0 = r1 + γ ⋅ G1 = 0 + 0.9 ⋅ 1.0 = 0.9
We just have one set of observations from this rollout, so we use those as estimates for the (s,a) pairs visited:
t=1: Q(3,right) = average(1.0) = 1.0
t=0: Q(2,right) = average(0.9) = 0.9
We then update the policy for the state-action pairs we have information about. The policy remains the same (50% left, 50% right) for the other states.
State 2: Q(2,left)=0 < Q(2,right)=0.9, so π(2)=right
State 3: Q(3,left)=0 < Q(3,right)=1.0, so π(3)=right
Rollout 2
We perform another iteration of the algorithm. We start by running an episode to collect rollout data:
Initial: s0=2, a0=right
t=1: r1=0, s1=1, a1=right (environment 20% to end up in opposite direction)
t=2: r2=0, s2=2, a2=right (new policy says we should choose right from state 2)
t=3: r3=0, s3=3, a3=right (new policy says we should choose right from state 3)
t=4: r4=1, s4=4 (terminal)
We then walk backward through the return G calculations:
t=4: G4 = 0 (terminal, no future rewards)
t=3: G3 = 1 + 0.9 ⋅ 0 = 1.0
t=2: G2 = 0 + 0.9 ⋅ 1.0 = 0.9
t=1: G1 = 0 + 0.9 ⋅ 0.9 = 0.81
t=0: G0 = 0 + 0.9 ⋅ 0.81 = 0.729
We then update Q values using the every-visit method. Notice that we are averaging the Q values from rollout 1 and rollout 2 to get the updated Q estimates. We are using information from every visit to an (s,a) pair from all previous rollouts.
t=3: Q(3,right) = average(1.0, 1.0) = 1.0
t=2: Q(2,right) = average(0.9, 0.9) = 0.9
t=1: Q(1,right) = average(0.81) = 0.81
t=0: Q(2,right) = average(0.9, 0.9, 0.729) = 0.843
We then update the policy for all visited (s,a) pairs.
State 1: Q(1,left)=0 < Q(1,right)=0.81, so π(1)=right
State 2: Q(2,left)=0 < Q(2,right)=0.843, so π(2)=right
State 3: Q(3,left)=0 < Q(3,right)=1.0, so π(3)=right
Rollout 3
Let’s do one more iteration. We run another episode to collect rollout data and get the following trajectory:
Initial: s0=2, a0=left (policy says choose right from state 2, 10% epsilon made us choose left)
t=1: r1=0, s1=1, a1=right (policy says choose right from state 1)
t=2: r2=0, s2=0 (environment 20% to end up in opposite direction, terminal)
This was an unlucky run, with the ε probability and stochastic environment ultimately ending us in state 0 with no rewards. However, this still has some useful data that we can use to update our knowledge of the environment. Let’s start with the backward G calculations:
t=2: G2 = 0 (terminal, no future rewards)
t=1: G1 = 0 + 0.9 ⋅ 0 = 0
t=0: G0 = 0 + 0.9 ⋅ 0 = 0
We then update our Q values for every (s,a) pair visited:
t=1: Q(1,right) = average(0.81, 0) = 0.405
t=0: Q(2,left) = average(0) = 0
Then, we update the policy for the states we visited:
State 1: Q(1,left)=0 < Q(1,right)=0.405, so π(1)=right
State 2: Q(2,left)=0 < Q(2,right)=0.843, so π(2)=right
State 3: Q(3,left)=0 < Q(3,right)=1.0, so π(3)=right
After three iterations, we can build a Q(s,a) table as follows, which logs the returns from each visit and the Q estimate for that (s,a) pair, which is just a simple average of the returns.
| State-Action | Returns | Q estimate |
| Q(1,left) | [] | 0 |
| Q(1,right) | [0.81, 0] | 0.405 |
| Q(2,left) | [] | 0 |
| Q(2,right) | [0.9, 0.9, 0.729] | 0.843 |
| Q(3,left) | [] | 0 |
| Q(3,right) | [1.0, 1.0] | 1.0 |
Note that we did not visit (1,left) and (3,left), so those are left empty. Ideally, as we continue to explore the environment (assuming the randomness from ε eventually takes us to all states), we should see the Q estimates converge to their actual values.
Monte Carlo Limitations
MC methods offer the ability to learn the optimal policy by exploring the environment without needing to understand the transition dynamics (i.e. a model of the environment). This is a powerful stepping stone from dynamic programming. However, MC has some limitations you should be aware of:
- You cannot update value or action estimates until the episode completes because you need the full return G. This makes it impossible for continuing tasks (like our original cartpole example) and requires long computation times for long episodes.
- Because G is the sum of many random rewards across a full episode, it can vary from episode to episode, leading to a high variance. As a result, you may need to perform many rollouts to get useful estimates, which can be slow and computationally expensive.
- If rewards are only given at the end of long episodes, this can take a while (and be computationally expensive) for the return estimates to work their way back to earlier actions.
What if we did not have to wait for the episode to end and could update our value estimates after each step? This is the motivation behind temporal difference (TD) learning, which we will cover in the next post.
Conclusion
We saw how we can sample data from actual episodes in order to construct action value estimates for the environment without needing to know the transition dynamics. This is an incredibly powerful technique as we move into model-free learning algorithms.
Note that RL textbooks, such as Sutton and Barto’s Reinforcement Learning: An Introduction, will often provide additional MC methods, such as state-value estimates, exploring starts, incremental updates, and so on. I chose to focus on the action-value estimate methods, as they provided the simplest examples of using sampling to estimate returns as well as perform policy improvement.
The MC methods we demonstrated are considered to be a form of on-policy learning, as the policy being updated is the same policy being used in the rollouts. We will examine off-policy learning algorithms in later posts.
In the next post, we’ll build on these ideas to update our estimates after each step (rather than waiting for episode completion) in TD learning.
As usual, if you have any questions or suggestions, please leave a comment!
