Over the past few years, we’ve seen an explosion of interest in AI. Most of the conversation has been centered around large language models and generative tools. While those are fascinating in their own right, I’ve personally been more interested in how AI techniques connect to the physical world, especially as it relates to embedded systems and robotics.
One idea I’ve been thinking about exploring is a project that sits at the intersection of all three: robotics, machine learning, and microcontrollers. The rough idea is this: train a reinforcement learning agent to balance a two-wheel robot in simulation and then deploy the trained policy to run on an ESP32-based robot.
In other words, take the classic self-balancing robot problem and approach it from a modern machine learning perspective instead of the usual PID-control approach. While the classic PID controller is likely a better (more efficient, faster, simpler) approach, using a balance bot seems to be an approachable and tactile way for many to try tinkering with reinforcement learning (RL).
Before I go too far down that path, though, I’d love to hear what people think about the idea.
Why I Like This Idea
The two-wheel balancing robot is one of those classic robotics problems that has been around for decades. It’s essentially the same problem solved by devices like the Segway: a robot balancing an inverted pendulum on top of two wheels. Traditionally, this problem is solved using control theory. Most implementations use some combination of a PID controller, sensor fusion from an IMU, and a fair amount of manual tuning. That approach works well and is often the best choice for many real-world systems. However, reinforcement learning offers a very different way of approaching the same problem.
Instead of designing the controller directly, you define a reward function and allow an agent to learn how to balance through trial and error in a simulated environment. Over many thousands (or millions) of iterations, the agent gradually learns a policy that maps sensor inputs to motor commands that keep the robot upright.
What makes this especially interesting right now is the growing momentum behind what some people are calling physical AI: machine learning systems that interact with and control real-world systems instead of just generating text or images. Robotics is one of the most natural places for this to happen. Training in simulation and deploying policies on real hardware has become much more accessible over the last few years thanks to improvements in tools, frameworks, and computing power. At the same time, embedded systems are becoming more capable. Microcontrollers like the ESP32 have enough compute and memory to run relatively small neural networks, especially when using optimized inference frameworks.
So the question that has been bouncing around in my head lately is this: Could we train a reinforcement learning policy in simulation and then run that policy directly on a microcontroller-based robot? This should be possible (in theory), as most deep RL agents use relatively simple neural networks for inference. However, it introduces a host of other problems, such as training in simulation, ensuring the simulation-to-real (sim2real) deployment works, etc. that make for interesting educational opportunities.
RL Robot as an Educational Series
From a teaching perspective, this kind of project has a lot of appealing elements, as it connects several topics that are often taught separately: embedded systems, robotics, control, and machine learning. Many tutorials focus on just one of these areas in isolation. A project like this could show how they all fit together in a complete system. It also brings RL into the real world and offers a path for people to start building their own robot characters in same vein as Disney’s BDX and Olaf robots.
It would also mirror a workflow that is becoming increasingly common in robotics and AI systems. Training in simulation and then deploying to real hardware is a core concept in modern robotics development. More importantly, it’s the kind of project that has a clear and intuitive goal. A robot balancing on two wheels is something you can immediately see and understand, which makes it great for explaining underlying concepts and keeping students engaged.
What the Series Might Look Like
If I were to turn this into a series, the overall arc would follow the full lifecycle of the project: from building the robot itself to eventually deploying a trained reinforcement learning policy onto embedded hardware. I probably wouldn’t start with reinforcement learning at all. The first step would be assembling a simple two-wheel robot and demonstrating how this problem is traditionally solved using control theory. A self-balancing robot is essentially an inverted pendulum, and most implementations rely on a PID controller combined with sensor fusion from an IMU. That approach provides a reliable baseline and a useful point of comparison before introducing machine learning.
The next step would be creating a physics-based simulation of the robot. Training reinforcement learning agents directly on physical hardware is usually impractical (it takes too long and would likely result in a lot of broken robots) so the idea would be to model the robot’s dynamics in simulation and use that environment for training.
Once the simulated robot behaves reasonably like the real one, the focus would shift to connecting a reinforcement learning agent and allowing it to learn the balancing task through interaction with the simulator. Instead of explicitly designing a controller, the agent would observe the robot’s state (e.g. tilt angle and angular velocity) and gradually learn which motor commands keep it upright. There are a number of interesting directions this part could go, such as introducing disturbances or uneven terrain in the simulation to help the agent learn a more robust policy. Training would likely involve some experimentation with hyperparameter optimization techniques like grid search or Bayesian optimization, and I’m also curious whether imitation learning could help bootstrap the process (for example, allowing the agent to initially learn from the behavior of the PID controller before improving on its own).
Finally, once a policy works well in simulation, the challenge becomes deploying it to the real robot. That means converting the trained neural network into a form that can run efficiently on a microcontroller like the ESP32 and integrating it into the control loop so the robot can compute motor commands in real time from IMU data. In practice, there is almost always a gap between simulation and reality, and bridging that simulation-to-real (sim2real) divide (i.e. figuring out why something that works perfectly in simulation struggles on physical hardware) is often one of the most interesting parts of projects like this. It’s also where many of the most valuable lessons tend to emerge.
A Lot of Open Questions
One of the reasons I’m writing about this now is that there are still many decisions to be made about how a project like this would be implemented. For example, there are several possible choices for the reinforcement learning stack. PyTorch-based libraries like Stable Baselines are popular, but there are also newer frameworks designed specifically for robotics and simulation. Similarly, there are several possible options for the physics simulation environment. Tools like PyBullet, Gymnasium, or even more advanced robotics simulators could all work, each with their own trade-offs in terms of realism, complexity, and learning curve.
On the embedded side, there are also interesting questions about how best to run the trained policy. The model might need to be simplified or quantized in order to fit within the memory constraints of a microcontroller. Alternatively, some of the computation could be offloaded to a companion processor while the microcontroller handles real-time control. These are exactly the kinds of decisions that can make a project like this educational. Understanding why one tool or approach is chosen over another is often more valuable than simply following a fixed set of instructions.
I’d Love Your Input
Before I commit to building something like this, I’d really like to get a sense of whether this is something people would find interesting or useful. If you’re someone working in embedded systems, robotics, or machine learning, I’d be especially curious to hear your perspective.
Does the idea of a reinforcement learning balance robot sound interesting? Would you be more interested in the robotics side of the project, the reinforcement learning concepts, or the embedded deployment aspects? Are there particular tools, frameworks, or topics that you’d want to see explored?
If you have experience with reinforcement learning in robotics, what platforms and frameworks do you recommend for simulation (ByBullet, Isaac Sim, Gazebo, etc.) as well as training the agent (Stable-Baselines3, RLlib, OpenRL, etc.)?
One of the things I enjoy most about writing and creating educational content is that the direction often evolves through conversations with readers and other engineers. Projects tend to get better when they incorporate ideas and perspectives from the community. If you have thoughts about how you would approach something like this (or if there are particular aspects you’d want explained in detail) I’d love to hear them. Please leave a comment here or through social media!

This is a great idea. For me robotics side and embedded deployment is of more interest.
Does the idea of a reinforcement learning balance robot sound interesting? Would you be more interested in the robotics side of the project, the reinforcement learning concepts, or the embedded deployment aspects? Are there particular tools, frameworks, or topics that you’d want to see explored?
My answer to your questions:
Absolutely — the idea of a reinforcement learning balance robot sounds fascinating, and honestly, all three aspects appeal to me equally: the robotics, the reinforcement learning concepts, and the embedded deployment. I’d love to see whichever angle makes for the most compelling and well-rounded exploration, or even all three woven together if possible.
I’d be interested in this series. I’m particularly interested in ROS and Gazebo for a variety of bodies from rollers to walkers but would tune in for other platforms to see different approaches.
this seems like an end to end dream project
That’s a great idea. Right now, I am experimenting with a reinforcement learning (RL) approach to see if it can perform as well as or better than traditional control algorithms using the M5Stack Bala2-Fire(https://docs.m5stack.com/en/app/bala2fire).
The key is how effectively I can bridge the Real2Sim gap. I am training the RL model in a simulation environment using PyBullet and Stable-Baselines3 to evaluate its balancing capability. Finally, I will develop the actual hardware application (Arduino) and challenge the robot to maintain its balance using the trained RL model.
After some trial and error, I managed to achieve a stable balance for over 10 minutes.
Thanks for the info and kit recommendation! I was looking at that kit as well, so it’s good to know that it can be used. Do you have any tips for addressing the Sim2Real gap?