Deep Learning on Microcontrollers: The State of Embedded ML in 2025

Edge AI has been gaining traction, and deploying deep learning models on microcontrollers (MCUs) has evolved from a niche experiment to a mainstream engineering practice. In 2025, embedded machine learning (ML) has grown from keyword spotting experiments to powering industrial anomaly detection, predictive maintenance, and vision-based automation, all on devices with just a few hundred kilobytes of RAM.

The typical workflow still follows the same pattern:

Train a model using a desktop or cloud framework such as TensorFlow, PyTorch, or Scikit-learn.
Optimize and compress the model through quantization, pruning, and conversion to lightweight formats (e.g., .tflite, .onnx).
Deploy the model using an embedded runtime that interprets or compiles it into efficient C/C++ code.
Run inference on the target MCU, often leveraging DSP or NPU acceleration.

Early embedded ML efforts included projects like Arm’s uTensor, mbed OS AI, and Microsoft ELL (Embedded Learning Library), which aimed to translate trained models into portable C++ for small devices. These frameworks were crucial in proving that neural networks could run on microcontrollers, but they’ve since been superseded by more capable runtimes such as LiteRT for Microcontrollers (formerly TensorFlow Lite Micro) and vendor-integrated SDKs. Most of these older projects are now archived or discontinued, serving primarily as historical milestones in the evolution of TinyML.

We’ll explore the state of embedded ML in 2025. In the article, we compare the major runtimes and toolchains you can use in 2025, from open-source frameworks like TensorFlow Lite Micro to vendor-specific SDKs and full-service platforms like Edge Impulse.

Training Frameworks

Before we get to runtimes, let’s look at the popular training ecosystems:

TensorFlow remains one of the most popular tools for embedded ML thanks to its seamless export path to LiteRT and LiteRT for Microcontrollers (formerly TensorFlow Lite and TensorFlow Lite Micro). TensorFlow 2.x’s tf.keras API and post-training quantization tools make it a friendly choice for MCU-class deployments.
PyTorch dominates research and edge-to-cloud workflows. With Meta’s introduction of ExecuTorch, PyTorch models now have a direct deployment path to embedded devices.
Scikit-learn (often abbreviated sklearn) continues to be the go-to framework for classical machine learning (e.g. decision trees, random forests, SVMs, logistic regression, and simple neural networks). While not a deep-learning framework itself, it plays a critical role in preprocessing data, feature engineering, and model baselining. Many embedded ML workflows begin in Scikit-learn before moving to TensorFlow or PyTorch for deep learning. In 2025, Scikit-learn models can also be converted to ONNX format using the sklearn-onnx converter, making them compatible with microcontroller runtimes that accept .onnx models.

ONNX (Open Neural Network Exchange) acts as a neutral bridge between frameworks. It enables exporting models from PyTorch, TensorFlow, or other tools into a standardized .onnx format for deployment across different runtimes.

Once trained, models are optimized, usually through quantization (reducing precision to INT8 or mixed precision), pruning (removing redundant weights), and sometimes distillation (transferring knowledge from a large model to a smaller one). The result is a compact model suitable for MCUs.

These compressed models are often deployed along with the firmware binary, which includes a runtime environment for the model. In the rest of the post, we’ll look at a wide variety of popular runtime environments for embedded systems. Note that I’ll mostly focus on microcontroller targets for these environments: desktop and mobile frameworks are beyond the scope of this article.

Agnostic Runtimes

These are frameworks designed to run on a wide variety of hardware platforms. They aren’t tied to a specific vendor’s silicon, making them attractive for prototyping or cross-platform work.

LiteRT for Microcontrollers (Formerly TensorFlow Lite Micro)

Designed to bring neural network inference to resource-constrained devices, this runtime enables models trained in frameworks like TensorFlow and PyTorch to run on microcontrollers with extremely limited RAM and no OS. As part of the broader LiteRT initiative from Google, this runtime aligns with the newer branding while maintaining compatibility with .tflite models and the existing embedded ML ecosystem. See here for more information.

Development: Active, led by Google with strong community contributions.
Hardware support: Portable C++ runtime for Cortex-M, RISC-V, ARC, and Xtensa cores. Integrates with CMSIS-NN and vendor-specific kernels for acceleration.
Ease of use: Excellent documentation, wide adoption, integrated with many SDKs.
Licensing: Open source (Apache 2.0).
Limitations: Lacks dynamic memory allocation (all tensor arenas must be pre-allocated), limited operator coverage compared to TensorFlow Lite.

ExecuTorch (Meta)

ExecuTorch is Meta’s embedded runtime for PyTorch models, enabling developers to deploy trained neural networks directly from PyTorch to mobile, embedded, and MCU-class devices. It focuses on unifying PyTorch’s export, optimization, and execution pipeline across edge platforms.

ExecuTorch supports MCUs including those with major accelerators (Ethos-U NPUs) and also runs on embedded/MCU-class CPUs, but its most fully-featured and optimized path is for hardware with dedicated NPUs (e.g., Ethos-U55/65/85). For very constrained MCU systems without NPUs, the user may need to fall back on CPU kernels. It’s also not as mature as LiteRT.

Development: Rapidly growing since 2023; backed by Meta and supported by partners like Arm and Qualcomm.
Hardware support: Targets embedded Linux and bare-metal MCUs; supports Arm Ethos-U NPUs and DSPs via delegate backends.
Ease of use: Integrates tightly with PyTorch workflows — you can export a model with torch.export() and deploy with minimal conversion steps.
Licensing: Open source (BSD-style).
Limitations: Newer ecosystem; smaller operator library than TensorFlow but expanding quickly.

microTVM (Apache TVM Project)

microTVM is the embedded branch of Apache TVM: an end-to-end compiler stack that generates highly optimized C code for specific hardware targets. Unlike frameworks such as LiteRT for Microcontrollers, microTVM isn’t a pre-built runtime. It’s a compiler framework that generates optimized C code for a specific MCU target, requiring integration with your existing toolchain and build system. This gives developers fine-grained control and maximum performance but adds complexity compared to drop-in interpreters.

Development: Mature open-source project under Apache Software Foundation.
Hardware support: Arm Cortex-M, RISC-V, and custom accelerators; generates optimized C code from model graphs.
Ease of use: Powerful but requires familiarity with TVM’s compilation stack; excellent for research and code-generation workflows.
Licensing: Apache 2.0.
Limitations: Not a “drop-in” runtime. In other words, you cannot just link to it as a library. It’s a tool that generates code for your target architecture.

AIfES

AIfES (Artificial Intelligence for Embedded Systems) is a lightweight, open-source ML framework from the Fraunhofer Institute for Microelectronic Circuits and Systems (IMS) built specifically for ultra-constrained devices (including 8-bit microcontrollers). It supports both inference and on-device training, making it a unique choice in the embedded-ML space for truly edge-only intelligence.

Development: Actively maintained (Arduino‐compatible port exists) and evolving with support for FNNs, CNNs, and training routines.
Supported hardware: Works on very small microcontrollers, including 8-bit Arduino Uno and Cortex-M devices; designed for tinyML use-cases.
Ease of use: Moderate difficulty. You’ll still need to embed and configure neural-network parameters, but the library abstracts much of the low-level complexity.
Licensing: Dual-license model: open for free under AGPLv3 for open-source use; commercial licensing required for proprietary use.
Limitations: Operator and model support is more limited compared to major frameworks; on-device training and very constrained memory footprints make it less common in industrial embedded ML stacks.

Ekkono Edge AI

Ekkono Edge AI is a commercial-grade embedded ML SDK from the Swedish company Ekkono Solutions, focusing on streaming analytics, self-learning/continuous learning on-device, and deployment across constrained hardware (from microcontrollers to PLCs). Its distinguishing feature is that the “learning” part can happen on the device itself, enabling adaptive, context-aware intelligence.

Development: Actively developed, with recent mentions in benchmarking studies and partner integrations.
Supported hardware: Offers “Ekkono Edge” (C++ library for devices with megabytes of memory, e.g., gateways or PLCs) and “Ekkono Crystal” (C library optimized for microcontrollers with only kilobytes of memory)
Ease of use: Designed to be easy once you adopt the Ekkono workflow; however, models are developed within the platform rather than via import of arbitrary frameworks.
Licensing: Proprietary, commercial license required; not fully open source.
Limitations: Because models must be developed in the Ekkono ecosystem, you may have less flexibility to use arbitrary models from TensorFlow/PyTorch without adaptation; horizon for very tiny MCUs is narrower compared to “drop-in” runtimes.

NanoEdge AI Studio

NanoEdge AI Studio (originally by Cartesiam, now part of STMicroelectronics) is a push-button, no-code to low-code tool aimed at embedded developers who lack deep ML expertise. It enables anomaly detection, classification or regression libraries to be generated for Arm Cortex-M microcontrollers in just a few clicks — lowering the barrier to TinyML deployment.

Development: Actively developed; optimized for STM32 ecosystem and continually updated.
Supported hardware: Targeted at Arm Cortex-M microcontrollers (STM32 family), with libraries generated to run in very small RAM/flash footprints (from ~0.5 kB to ~10 kB) per case study.
Ease of use: Very easy. Desktop tool guides through data capture, processing, and library export; minimal ML background needed.
Licensing: Free evaluation license available; production use requires paid license/contact ST/Cartsesiam.
Limitations: Focused on STM32 ecosystem (though libraries may run more broadly); less flexibility for deep custom networks compared to full-code runtimes; best suited to signal processing/anomaly use cases rather than large vision networks.

ONNX Runtime for MCUs (ORT Micro)

ONNX Runtime is a cross-platform inference engine designed for desktops, servers, and embedded-Linux systems, supporting acceleration through CPU, GPU, and NPU execution providers. It does not currently run on bare-metal microcontrollers. While there is no “ONNX Runtime Micro”, many vendor SDKs can import .onnx models and compile them into MCU-friendly code.

Vendor-Specific Runtimes

These SDKs are tailored for particular microcontroller families and often provide code generation, quantization, and accelerator support out of the box.

STMicroelectronics: STM32Cube.AI

ST’s STM32Cube.AI (formerly X-CUBE-AI) is a code-generation tool that converts trained neural networks into optimized C code for STM32 microcontrollers. It’s built to make ML deployment as seamless as firmware generation inside STM32CubeIDE.

Development: Very active, integrated into STM32CubeMX and STM32CubeIDE.
Hardware: STM32 MCUs (Cortex-M0+ to M7, H7).
Ease of use: Excellent graphical integration; converts .onnx, .tflite, and Keras models into C code.
License: Free for STM32 users.
Limitations: Proprietary tooling; cannot easily port outside STM32 ecosystem.

NXP: eIQ ML Software Development Environment

NXP’s eIQ software environment provides an integrated ML workflow (model conversion to deployment) for their i.MX RT crossover MCUs and application processors. It combines TensorFlow Lite, Glow, and ONNX support with hardware acceleration through NXP libraries.

Development: Active, part of NXP MCUXpresso suite.
Hardware: i.MX RT crossover MCUs, i.MX 8/9 SoCs.
Ease of use: Integrates TensorFlow Lite and Glow; provides model converter GUI.
License: Free with NXP SDK.
Limitations: Best performance on NXP hardware; limited support for other vendors.

Renesas: e-AI and RUHMI

Renesas’s e-AI framework allows developers to integrate neural networks directly into Renesas MCUs and MPUs, automating model optimization, quantization, and code generation for devices across the RA, RX, and RZ families.

RUHMI (Robust Unified Heterogenous Model Integration) extends e-AI’s capabilities with a unified runtime and UI framework designed to combine traditional embedded processing with AI-enhanced interfaces (e.g. voice, vision, and gesture recognition) on next-generation parts like the RA8P1 with built-in NPUs. Together, e-AI and RUHMI offer a full stack for deploying and interacting with intelligent embedded systems, from low-level inference to user-facing applications.

Development: Active, especially with RA and RZ families.
Hardware: RA, RX, and RZ MCUs/MPUs, including new Arm Ethos-U NPU support on RA8P1.
Ease of use: GUI-based code generator, works with TensorFlow Lite models.
License: Free with Renesas tools; some modules proprietary.
Limitations: Limited community content; requires Renesas e2 studio.

Espressif: ESP-DL and ESP-NN

ESP-DL is a full inference library for Espressif SoCs (e.g., ESP32 series) that supports loading, quantizing, converting, and running neural networks. It includes APIs for neural networks, image processing, matrix operations, and is oriented toward deep-learning models on the ESP-IDF platform.

ESP-NN is a library of optimized neural-network functions for Espressif chips (such as ESP32-S3, ESP32-C3) that acts as a performance/back-end library: it provides low-level kernels (e.g., convolution, activations) that higher-level runtimes (like TFLite Micro) can use.

Development: Open source and maintained by Espressif.
Hardware: ESP32-S3, ESP32-P4, and other Xtensa/ESP chips.
Ease of use: Integrates with ESP-IDF; models converted using TensorFlow Lite tools.
License: Apache 2.0.
Limitations: Primarily optimized for vision and audio; lacks general model import beyond TFLite.

Infineon: ModusToolbox for ML

ModusToolbox for ML integrates AI model deployment directly into Infineon’s ModusToolbox ecosystem, enabling TensorFlow Lite models to be converted and optimized for PSoC MCUs. It streamlines ML integration into mixed analog–digital embedded designs.

Development: Growing rapidly; integrated with ModusToolbox ecosystem.
Hardware: PSoC 6/62 and AI-capable Infineon devices.
Ease of use: Code-generation workflow; supports TensorFlow Lite models.
License: Free within Infineon tools.
Limitations: Closed-source converters; ecosystem smaller than ST or NXP.

Texas Instruments: Edge AI Studio

TI’s Edge AI Studio is a low-code, full-workflow tool suite that enables data-capture, model training/re-training, compilation/optimization and deployment to TI MCUs and SoCs. While it supports MCUs, the actual inference at runtime uses TI’s underlying libraries and accelerator SDKs rather than a completely new runtime engine. Ideal if you’re targeting TI devices and prefer a guided workflow over building from scratch.

Development: Actively maintained by TI with both cloud-based and desktop tool versions (e.g., Model Composer, Model Analyzer) for deploying edge AI workflows.
Hardware support: Supports a wide range of TI processors, MCUs, and radar sensors (e.g. AM62A3/AM68A vision SoCs and C2000-series microcontrollers).
Ease of use: Very easy: provides GUI tools for data collection, model training/re-training (including BYOD), model compilation/deployment and performance benchmarking with minimal ML background required.
Licensing: Free to use (cloud and desktop tools) for TI devices; underlying software tools such as the “Edge AI Software and Development Tools” repository are under BSD-3-Clause license.
Limitations: Primarily optimized for TI’s hardware ecosystem (less portable to non-TI silicon); the GUI tools may support only a subset of model types (e.g., classification/detection/segmentation) and may require a TI development board for full benchmarking.

Microchip: ML Development Toolkit

Microchip’s ML Development Suite focuses on the workflow of model creation and conversion. The runtime on the MCU is essentially the generated code optimized for Microchip hardware rather than a separate runtime library.

Development: Active; focuses on AVR and SAM microcontrollers.
Hardware: 8-bit to 32-bit MCUs.
Ease of use: Provides ML model importer and code generator integrated into MPLAB X.
License: Free within MPLAB; closed-source tools.
Limitations: Smaller model/operator coverage compared to ST/NXP.

Platforms as a Service

Edge Impulse (acquired by Qualcomm, 2024)

Edge Impulse provides a full end-to-end platform: from data collection to model training, optimization, and deployment. Models can be exported to TensorFlow Lite Micro, ONNX, or even fully-compiled C++ code. Note that Edge Impulse relies on a customized version of LiteRT for Microcontrollers for its runtime. You can see an example of an Edge Impulse model here.

Ease of use: Excellent web interface and SDK integration.
Hardware: Broadest support in the industry: works with STM32, Nordic, Renesas, NXP, Espressif, and many more.
License: Free tier with paid enterprise features.
Limitations: Closed-source platform; some advanced optimization features only in enterprise plans.

Neuton.AI (acquired by Nordic Semiconductor, 2025)

Neuton.AI is a fully-automated TinyML model-generation platform that lets developers with little or no ML-expertise quickly build ultra-small neural-network models (often under ~5 kB) and deploy them directly on embedded devices. It emphasizes optimized memory footprint and rapid deployment across 8-, 16-, and 32-bit MCUs and smart sensors.
Hardware support: Targets extremely constrained devices (8-,16-,32-bit MCUs and sensor ISPs) and is validated in partnership ecosystems (e.g., Arm, STMicroelectronics) for low-footprint embedded deployments.

Ease of use: Very easy: no coding required; model generation and export are highly automated via a no-code/low-code interface. Designed for embedded developers rather than ML specialists.
License: Proprietary/commercial platform (not open-source) although the platform offers accessible tiers and targets broad availability.
Limitations: Because models are generated via a highly automated, “black-box” process, developers may have less direct control over network architecture, operator selection, and runtime optimization compared to fully open frameworks. Also, since it’s focused on extremely small footprints, it may be less optimized for large vision models or devices with large memory.

MicroAI

MicroAI is a commercial edge-AI platform that targets embedded and industrial systems with “agentic” intelligence: on-device models, real-time insights, and self-learning assets. While it covers the full stack from data ingestion to inference and analytics, it’s less of a pure MCU-runtime library and more of a packaged solution for deploying intelligence across devices and industrial infrastructure.

Hardware support: Targets embedded devices including MCUs/MPUs, industrial edge appliances, and IoT equipment; specifically cited for use with Arm Cortex-M series and industrial devices.
Ease of use: Designed for rapid deployment in industrial systems with minimal ML engineering required; offers “self-learning” engines, no-code/low-code workflows, and edge-native deployment.
Licensing: Proprietary/commercial offering. The focus is on enterprise/industrial licensing rather than open-source runtime libraries.
Limitations: Because it is a full-stack, packaged system rather than a lightweight runtime, it may offer less direct control over model internals, operators, or firmware-level optimization compared to lean MCU-centric runtimes; also may incur higher cost/licensing and assume more infrastructure.

Community and Experimental Runtimes

Beyond the major frameworks, there’s a small but vibrant ecosystem of community-built runtimes that experiment with lightweight inference on bare-metal hardware. Projects like onnx2c, tinygrad-micro, and microinfer translate trained models directly into portable C code, often with minimal dependencies. These tools are typically open source and highly portable, making them great for hobbyists or academic work, though they tend to have limited operator coverage, sparse documentation, and aren’t yet ready for large-scale production use.

Choosing the Right Runtime

Each runtime offers a different balance between flexibility, performance, and support. I recommend looking into the following based on your needs:

LiteRT (formerly TensorFlow Lite Micro) is still the gold standard for running neural networks on a microcontroller. Give ExecuTorch a shot if you want to stay in the PyTorch family.
Vendor-specific SDKs (e.g. STM32Cube.AI, NXP eIQ, Renesas e-AI) should be considered if you’re targeting a single hardware family and want tight IDE integration and optimized kernels.
Edge Impulse (or similar) are great if you want a turnkey, end-to-end workflow. I find these low-code platforms ideal for rapid prototyping and education but struggle if you need customizability, fast performance, or tight integration with a particular vendor toolset.

Going Further

In 2025, the embedded ML ecosystem is converging around interoperability: ONNX as the lingua franca, INT8 quantization as the default precision, and Arm Ethos-U NPUs providing significant acceleration even in mid-range MCUs. As hardware vendors adopt these common interfaces, deploying deep learning models to the edge is becoming more accessible, efficient, and unified across toolchains.

The key is to match your workflow to your constraints. Whether you want a fully-managed SaaS solution or a bare-metal runtime you can tweak in C, there’s never been a better time to bring intelligence to the tiniest devices.

I recommend checking out this paper if you’d like to dive more into the performance metrics of some of these frameworks. If I missed a framework in the article, please let me know in the comments!

If you’re interested in learning more about embedded ML, I recommend checking out my Coursera course.