Feasibility Analysis for Embedded ML

Dec 16

How to Know If Your Model Will Fit Before You Write Any Code

The graveyard of embedded ML projects is filled with models that worked perfectly in Python. The failure mode is almost always the same: months into integration, someone discovers the model doesn't fit in SRAM, or inference takes 400ms when the budget was 50ms, or the weights exceed Flash capacity by 30%. These are answerable questions. You can know before writing any firmware whether a project is feasible.

This guide walks through the quantitative analysis. We'll cover memory estimation, latency prediction, and the calculations that separate viable projects from expensive lessons. The running example is vibration analysis for drone motor health, but the framework applies to any embedded ML problem.

Before You Start: The DSP Question

The first feasibility question isn't about memory or latency. It's whether you need ML at all.

For drone vibration analysis, many failure modes have known frequency signatures. Propeller imbalance produces a strong peak at rotor frequency. Bearing wear shows broadband noise increase in specific bands. Motor winding issues appear as electrical frequency harmonics. An FFT followed by threshold detection on specific bins might achieve 90% of the accuracy at 10% of the computational cost.

ML shines when patterns are too complex to characterize analytically, when you need to distinguish between similar-looking anomalies, or when the "normal" baseline varies in ways that resist fixed thresholds. If you can describe the trigger condition in a sentence without using the word "like," DSP is probably sufficient. "Trigger if amplitude at 400Hz exceeds 2g" is DSP. "Trigger if it sounds like a failing bearing" is ML.

The optimal architecture often combines both: DSP for feature extraction, ML for classification. This reduces the neural network's job from "understand raw vibration" to "classify this feature vector," which requires a much smaller model. But that's an optimization. First, let's see if the naive approach fits at all.

The Three Budgets

Every embedded ML feasibility analysis reduces to three questions:

Flash: Do the weights fit in non-volatile storage?
SRAM: Can the device hold activations during inference?
Latency: Can inference complete within the timing budget?

If any answer is no, you need a different model or different hardware. The goal is to answer these questions with napkin math before touching a compiler.

Worked Example: Drone Motor Vibration Analysis

You're building a system that monitors drone motor health by analyzing accelerometer data. The goal: detect bearing wear, propeller imbalance, or impending failures before they cause a crash.

Application Requirements:

Sample rate: 4kHz accelerometer (3-axis)
Analysis window: 256ms of data (1024 samples per axis)
Response time: Alert within 500ms of anomaly onset
Target hardware: STM32L476 (Cortex-M4F @ 80MHz, 128KB SRAM, 1MB Flash)
Power budget: Continuous operation on battery

The Model: A 1D CNN for vibration classification.

Input: 1024 × 3 (time samples × axes)
Conv1D: 32 filters, kernel size 8, stride 4 → output 256 × 32
Conv1D: 64 filters, kernel size 4, stride 2 → output 128 × 64
Global Average Pooling → 64
Dense: 64 → 4 classes (healthy, bearing wear, imbalance, unknown)

Now let's see if it fits.

Budget 1: Flash (Weight Storage)

Weights are stored in Flash and read during inference. Count parameters, multiply by bytes per parameter.

Counting Parameters

Conv1D Layer 1: 8 × 3 × 32 + 32 = 800 parameters
Conv1D Layer 2: 4 × 32 × 64 + 64 = 8,256 parameters
Dense: 64 × 4 + 4 = 260 parameters

Total: 9,316 parameters

Bytes per Parameter

Format	Bytes/Param	Model Size
Float32	4	37 KB
Float16	2	19 KB
INT8	1	9 KB

With INT8 quantization: ~10KB for weights. But weights aren't the only Flash consumer. If you're using TensorFlow Lite Micro, the interpreter itself takes 50-100KB. AOT-compiled models avoid this overhead.

Weights: ~10KB
TFLite Micro runtime: ~60KB
Application code: ~40KB
Total: ~110KB of 1MB Flash (11%)

The runtime is 6× larger than the model. For very constrained devices, this overhead often matters more than model size.

Budget 2: SRAM (Activation Memory)

This is where projects die. SRAM usage isn't the sum of all layer outputs—it's the peak memory required at any moment during inference.

The High Water Mark

The runtime allocates a single contiguous buffer (the tensor arena) at startup. During inference, once a layer's output has been consumed, that memory is reused. The peak usage occurs at the boundary between two large layers, where both input and output must coexist.

Calculating Activation Sizes

  Input buffer:           1024 × 3 × 1 byte  =  3,072 bytes

  After Conv1D Layer 1:    256 × 32 × 1 byte =  8,192 bytes

  After Conv1D Layer 2:    128 × 64 × 1 byte =  8,192 bytes

  After Global Avg Pool:    64 × 1 byte      =     64 bytes

Finding the Peak

  Layer 1: input (3,072) + output (8,192)  = 11,264 bytes

  Layer 2: input (8,192) + output (8,192)  = 16,384 bytes  ← Peak

  Dense:   input (64) + output (4)         =     68 bytes

But we're not done. Convolutions often require scratch buffers for the im2col transformation. For our Conv1D layers, this adds roughly 8-10KB at the peak.

Peak activations: ~16KB
im2col scratch: ~10KB
Total arena requirement: ~26KB

The Safety Margin

Never plan to use 100% of SRAM. You need headroom for stack, DMA buffers, and application state. Target 80% utilization maximum.

For the STM32L476 with 128KB SRAM: ~100KB available for ML. Our requirement is 26KB. That's 26% utilization, leaving 74KB for application logic.

The Input Resolution Trap: Our model's SRAM usage is dominated by intermediate activations, not the input. But if we'd started with 4096 samples instead of 1024, or skipped the aggressive stride-4 in the first layer, the memory profile would look very different. A common failure mode: the first layer produces activations so large that nothing else fits. The fix is aggressive early downsampling.

Budget 3: Latency

Latency estimation requires knowing both computational cost and how efficiently your hardware executes it.

Counting MACs

A MAC (Multiply-Accumulate) is the fundamental unit. For Conv1D: Output_Length × Output_Channels × Kernel_Size × Input_Channels.

Conv1D Layer 1: 256 × 32 × 8 × 3 = 196,608 MACs
Conv1D Layer 2: 128 × 64 × 4 × 32 = 1,048,576 MACs
Dense: 64 × 4 = 256 MACs

Total: ~1.25M MACs

Cycles per MAC

How many clock cycles per MAC? This depends on architecture and optimization quality.

Architecture	Peak	Real-World
Cortex-M4F (DSP)	0.5 MACs/cyc	2-3 cyc/MAC
Cortex-M7	1.0 MACs/cyc	1-1.5 cyc/MAC
Cortex-M55 (Helium)	4.0 MACs/cyc	0.3-0.5 cyc/MAC
Ethos-U55 (NPU)	128+ MACs/cyc	0.01-0.02 cyc/MAC

Real-world numbers account for memory stalls, loop overhead, and pipeline bubbles. Never plan with peak numbers.

The Latency Equation

  Tinference = (MACs × Cycles/MAC) / Clock_Frequency

  Tinference = (1,250,000 × 2.5) / 80,000,000

  Tinference ≈ 39ms

Memory Bandwidth

The latency equation assumes you can feed data to the processor fast enough. For models in internal Flash (100+ MB/s), this is rarely the bottleneck. But if weights spill to external QSPI (10-50 MB/s), bandwidth limits inference rate regardless of CPU speed.

Our 10KB model in internal Flash has no bandwidth concerns.

System Latency: The Full Picture

Inference time is only part of the story. Total system latency determines whether you meet application requirements:

  Ttotal = Tsensor + Tpreprocessing + Tinference + Tpostprocessing

For our system:

T_sensor: 256ms (1024 samples at 4kHz—physical constraint, not optimizable)
T_{preprocessing}: ~5ms (FFT, normalization)
T_inference: 39ms
T_{postprocessing}: <1ms

Total: ~301ms

Our requirement was response within 500ms. We have margin. But notice: sensor accumulation dominates. If the requirement had been 200ms, no amount of model optimization would help. You'd need a shorter input window, which means retraining on less temporal context.

The Verdict

Budget	Requirement	Available	Utilization
Flash	~110KB	1MB	11%
SRAM	~26KB	128KB	26%
Inference	39ms	—	—
System Latency	301ms	500ms	60%

This model fits with headroom on all three budgets. That headroom matters: it absorbs the inevitable surprises during integration, leaves room for application logic, and allows future model improvements without hardware changes.

When the Numbers Don't Work

What if feasibility analysis shows the model doesn't fit? Options, roughly in order of preference:

Reduce model complexity. Fewer filters, smaller kernels, fewer layers, lower input resolution. Each change has accuracy implications, but these are usually the cheapest fixes.

Optimize the model. INT8 quantization (if not already applied), pruning, knowledge distillation. These preserve architecture while reducing resource requirements.

Change the architecture. Depthwise separable convolutions cut MACs by 8-9×. A DSP+ML hybrid (spectrogram features fed to a tiny classifier) can be far smaller than an end-to-end CNN.

Upgrade the hardware. More SRAM, faster clock, or hardware acceleration. This is usually the most expensive option, both in BOM cost and development time.

The order matters. Moving down the list increases project complexity and cost.

Validating Before Hardware

Napkin math gets you 80% of the way. The gap between estimate and measurement is typically 10-30%, which is why targeting 80% utilization provides necessary margin.

Before committing to physical hardware, virtual platforms let you validate estimates. Arm Virtual Hardware provides cycle-accurate Cortex-M simulation in the cloud—run your compiled binary and get precise timing without a dev board. Renode offers broader MCU coverage for functional validation and approximate timing. Even QEMU, though not cycle-accurate, catches quantization bugs and memory issues before they reach silicon.

The progression is typically: QEMU for functional correctness, Renode for system integration, AVH for accurate performance numbers. If your feasibility analysis shows comfortable margins, you can often skip straight to hardware. If it's tight, virtual validation prevents expensive surprises.

The Framework

Every embedded ML feasibility analysis follows the same structure:

Ask whether ML is necessary, or if DSP suffices
Calculate Flash: parameters × bytes/param + runtime overhead
Calculate SRAM: peak of (input + output + scratch) across all layers
Calculate inference latency: MACs × cycles/MAC ÷ clock frequency
Calculate system latency: sensor + preprocessing + inference + postprocessing
Compare against hardware specs and timing requirements

If utilization exceeds 80% on any resource, either optimize or upgrade before writing firmware. Changing the model is easy in the planning phase. Discovering it doesn't fit during integration is expensive.

The calculations here are first-order approximations. Production systems need empirical validation: actual profiling on target hardware, real memory measurements from the runtime, testing with representative data. But these approximations are accurate enough to kill doomed projects early and give viable ones a clear path forward.

Carlo Pecora Grisafi