Feasibility Analysis for Embedded ML
How to Know If Your Model Will Fit Before You Write Any Code
The graveyard of embedded ML projects is filled with models that worked perfectly in Python. The failure mode is almost always the same: months into integration, someone discovers the model doesn't fit in SRAM, or inference takes 400ms when the budget was 50ms, or the weights exceed Flash capacity by 30%. These are answerable questions. You can know before writing any firmware whether a project is feasible.
This guide walks through the quantitative analysis. We'll cover memory estimation, latency prediction, and the calculations that separate viable projects from expensive lessons. The running example is vibration analysis for drone motor health, but the framework applies to any embedded ML problem.
Before You Start: The DSP Question
The first feasibility question isn't about memory or latency. It's whether you need ML at all.
For drone vibration analysis, many failure modes have known frequency signatures. Propeller imbalance produces a strong peak at rotor frequency. Bearing wear shows broadband noise increase in specific bands. Motor winding issues appear as electrical frequency harmonics. An FFT followed by threshold detection on specific bins might achieve 90% of the accuracy at 10% of the computational cost.
ML shines when patterns are too complex to characterize analytically, when you need to distinguish between similar-looking anomalies, or when the "normal" baseline varies in ways that resist fixed thresholds. If you can describe the trigger condition in a sentence without using the word "like," DSP is probably sufficient. "Trigger if amplitude at 400Hz exceeds 2g" is DSP. "Trigger if it sounds like a failing bearing" is ML.
The optimal architecture often combines both: DSP for feature extraction, ML for classification. This reduces the neural network's job from "understand raw vibration" to "classify this feature vector," which requires a much smaller model. But that's an optimization. First, let's see if the naive approach fits at all.
The Three Budgets
Every embedded ML feasibility analysis reduces to three questions:
- Flash: Do the weights fit in non-volatile storage?
- SRAM: Can the device hold activations during inference?
- Latency: Can inference complete within the timing budget?
If any answer is no, you need a different model or different hardware. The goal is to answer these questions with napkin math before touching a compiler.
Worked Example: Drone Motor Vibration Analysis
You're building a system that monitors drone motor health by analyzing accelerometer data. The goal: detect bearing wear, propeller imbalance, or impending failures before they cause a crash.
Application Requirements:
- Sample rate: 4kHz accelerometer (3-axis)
- Analysis window: 256ms of data (1024 samples per axis)
- Response time: Alert within 500ms of anomaly onset
- Target hardware: STM32L476 (Cortex-M4F @ 80MHz, 128KB SRAM, 1MB Flash)
- Power budget: Continuous operation on battery
The Model: A 1D CNN for vibration classification.
- Input: 1024 × 3 (time samples × axes)
- Conv1D: 32 filters, kernel size 8, stride 4 → output 256 × 32
- Conv1D: 64 filters, kernel size 4, stride 2 → output 128 × 64
- Global Average Pooling → 64
- Dense: 64 → 4 classes (healthy, bearing wear, imbalance, unknown)
Now let's see if it fits.
Budget 1: Flash (Weight Storage)
Weights are stored in Flash and read during inference. Count parameters, multiply by bytes per parameter.
Counting Parameters
Conv1D Layer 2: 4 × 32 × 64 + 64 = 8,256 parameters
Dense: 64 × 4 + 4 = 260 parameters
Total: 9,316 parameters
Bytes per Parameter
| Format | Bytes/Param | Model Size |
|---|---|---|
| Float32 | 4 | 37 KB |
| Float16 | 2 | 19 KB |
| INT8 | 1 | 9 KB |
With INT8 quantization: ~10KB for weights. But weights aren't the only Flash consumer. If you're using TensorFlow Lite Micro, the interpreter itself takes 50-100KB. AOT-compiled models avoid this overhead.
TFLite Micro runtime: ~60KB
Application code: ~40KB
Total: ~110KB of 1MB Flash (11%)
The runtime is 6× larger than the model. For very constrained devices, this overhead often matters more than model size.
Budget 2: SRAM (Activation Memory)
This is where projects die. SRAM usage isn't the sum of all layer outputs—it's the peak memory required at any moment during inference.
The High Water Mark
The runtime allocates a single contiguous buffer (the tensor arena) at startup. During inference, once a layer's output has been consumed, that memory is reused. The peak usage occurs at the boundary between two large layers, where both input and output must coexist.
Calculating Activation Sizes
After Conv1D Layer 1: 256 × 32 × 1 byte = 8,192 bytes
After Conv1D Layer 2: 128 × 64 × 1 byte = 8,192 bytes
After Global Avg Pool: 64 × 1 byte = 64 bytes
Finding the Peak
Layer 2: input (8,192) + output (8,192) = 16,384 bytes ← Peak
Dense: input (64) + output (4) = 68 bytes
But we're not done. Convolutions often require scratch buffers for the im2col transformation. For our Conv1D layers, this adds roughly 8-10KB at the peak.
im2col scratch: ~10KB
Total arena requirement: ~26KB
The Safety Margin
Never plan to use 100% of SRAM. You need headroom for stack, DMA buffers, and application state. Target 80% utilization maximum.
For the STM32L476 with 128KB SRAM: ~100KB available for ML. Our requirement is 26KB. That's 26% utilization, leaving 74KB for application logic.
Budget 3: Latency
Latency estimation requires knowing both computational cost and how efficiently your hardware executes it.
Counting MACs
A MAC (Multiply-Accumulate) is the fundamental unit. For Conv1D: Output_Length × Output_Channels × Kernel_Size × Input_Channels.
Conv1D Layer 2: 128 × 64 × 4 × 32 = 1,048,576 MACs
Dense: 64 × 4 = 256 MACs
Total: ~1.25M MACs
Cycles per MAC
How many clock cycles per MAC? This depends on architecture and optimization quality.
| Architecture | Peak | Real-World |
|---|---|---|
| Cortex-M4F (DSP) | 0.5 MACs/cyc | 2-3 cyc/MAC |
| Cortex-M7 | 1.0 MACs/cyc | 1-1.5 cyc/MAC |
| Cortex-M55 (Helium) | 4.0 MACs/cyc | 0.3-0.5 cyc/MAC |
| Ethos-U55 (NPU) | 128+ MACs/cyc | 0.01-0.02 cyc/MAC |
Real-world numbers account for memory stalls, loop overhead, and pipeline bubbles. Never plan with peak numbers.
The Latency Equation
Tinference = (1,250,000 × 2.5) / 80,000,000
Tinference ≈ 39ms
Memory Bandwidth
The latency equation assumes you can feed data to the processor fast enough. For models in internal Flash (100+ MB/s), this is rarely the bottleneck. But if weights spill to external QSPI (10-50 MB/s), bandwidth limits inference rate regardless of CPU speed.
Our 10KB model in internal Flash has no bandwidth concerns.
System Latency: The Full Picture
Inference time is only part of the story. Total system latency determines whether you meet application requirements:
For our system:
Tpreprocessing: ~5ms (FFT, normalization)
Tinference: 39ms
Tpostprocessing: <1ms
Total: ~301ms
Our requirement was response within 500ms. We have margin. But notice: sensor accumulation dominates. If the requirement had been 200ms, no amount of model optimization would help. You'd need a shorter input window, which means retraining on less temporal context.
The Verdict
| Budget | Requirement | Available | Utilization |
|---|---|---|---|
| Flash | ~110KB | 1MB | 11% |
| SRAM | ~26KB | 128KB | 26% |
| Inference | 39ms | — | — |
| System Latency | 301ms | 500ms | 60% |
This model fits with headroom on all three budgets. That headroom matters: it absorbs the inevitable surprises during integration, leaves room for application logic, and allows future model improvements without hardware changes.
When the Numbers Don't Work
What if feasibility analysis shows the model doesn't fit? Options, roughly in order of preference:
Reduce model complexity. Fewer filters, smaller kernels, fewer layers, lower input resolution. Each change has accuracy implications, but these are usually the cheapest fixes.
Optimize the model. INT8 quantization (if not already applied), pruning, knowledge distillation. These preserve architecture while reducing resource requirements.
Change the architecture. Depthwise separable convolutions cut MACs by 8-9×. A DSP+ML hybrid (spectrogram features fed to a tiny classifier) can be far smaller than an end-to-end CNN.
Upgrade the hardware. More SRAM, faster clock, or hardware acceleration. This is usually the most expensive option, both in BOM cost and development time.
The order matters. Moving down the list increases project complexity and cost.
Validating Before Hardware
Napkin math gets you 80% of the way. The gap between estimate and measurement is typically 10-30%, which is why targeting 80% utilization provides necessary margin.
Before committing to physical hardware, virtual platforms let you validate estimates. Arm Virtual Hardware provides cycle-accurate Cortex-M simulation in the cloud—run your compiled binary and get precise timing without a dev board. Renode offers broader MCU coverage for functional validation and approximate timing. Even QEMU, though not cycle-accurate, catches quantization bugs and memory issues before they reach silicon.
The progression is typically: QEMU for functional correctness, Renode for system integration, AVH for accurate performance numbers. If your feasibility analysis shows comfortable margins, you can often skip straight to hardware. If it's tight, virtual validation prevents expensive surprises.
The Framework
Every embedded ML feasibility analysis follows the same structure:
- Ask whether ML is necessary, or if DSP suffices
- Calculate Flash: parameters × bytes/param + runtime overhead
- Calculate SRAM: peak of (input + output + scratch) across all layers
- Calculate inference latency: MACs × cycles/MAC ÷ clock frequency
- Calculate system latency: sensor + preprocessing + inference + postprocessing
- Compare against hardware specs and timing requirements
If utilization exceeds 80% on any resource, either optimize or upgrade before writing firmware. Changing the model is easy in the planning phase. Discovering it doesn't fit during integration is expensive.
The calculations here are first-order approximations. Production systems need empirical validation: actual profiling on target hardware, real memory measurements from the runtime, testing with representative data. But these approximations are accurate enough to kill doomed projects early and give viable ones a clear path forward.