Home

Posts

Running Inference at the Edge: The Memory Budget

May 3, 2025

#edge-ai

1. Quick-look, “in a nutshell”

Your quantized model is listed at 4 GB on the model card. You load it onto an NVIDIA Jetson Orin with 8 GB of unified memory. It OOMs at 3.2 GB loaded. You reduce batch size to 1. Still OOM. You switch to INT4 weights. Finally it runs — but the throughput is half of what you expected.

The memory budget for edge inference is not the model size. The model size is the floor. Above it sit the KV cache, activation buffers, framework overhead, OS footprint, and peak memory spikes during decoding — none of which appear on the model card, and all of which can individually exceed the model’s own weight footprint.

This article builds a complete picture of where memory goes during inference, derives the formulas for each component, and applies them to a concrete example: Llama-3 8B on a Jetson Orin NX 16 GB.

2. The five memory consumers

Total memory = weights + KV cache + activations + framework overhead + OS/runtime

Most discussions stop at weights. The rest of the budget is equally important at the edge, where you are often trying to fit an inference workload into a fixed LPDDR5 or unified memory pool shared with the CPU and display subsystem.

3. Weights

The one everyone calculates:

weight_memory = num_parameters × bytes_per_parameter

Precision	Bytes per parameter	7B model
FP32	4	28 GB
FP16 / BF16	2	14 GB
INT8	1	7 GB
INT4 / MXFP4	0.5	3.5 GB

For a 7B model in INT4, the weight floor is 3.5 GB. Everything else is additive.

4. KV cache

The KV cache stores the keys and values computed for each token in the attention layers, so they do not need to be recomputed on subsequent decode steps. It is the largest single source of surprise in LLM memory budgets.

kv_cache_bytes =
    2          (key + value)
  × num_layers
  × num_kv_heads
  × head_dim
  × max_sequence_length
  × batch_size
  × bytes_per_element

For a Llama-3 8B model (32 layers, 8 KV heads, head dim 128) running at FP16, with a 4096-token context and batch size 1:

2 × 32 × 8 × 128 × 4096 × 1 × 2 bytes
= 2 × 32 × 8 × 128 × 4096 × 2
= 536,870,912 bytes
≈ 512 MB

At batch size 4, that is 2 GB — more than half the weight footprint. At a 32K context with batch size 4, it is 16 GB — more than the weights themselves.

The key insight: KV cache scales linearly with both batch size and context length. Increasing either has immediate memory consequences that are invisible from the model card.

On edge hardware with limited memory, KV cache is typically the binding constraint on maximum batch size and context length. Options for reducing it:

Grouped-query attention (GQA): reduces num_kv_heads (Llama-3 uses GQA, hence 8 KV heads instead of 32). Check whether your model uses MHA or GQA.
KV cache quantization: storing KV cache in INT8 instead of FP16 halves this cost with a small accuracy penalty.
Sliding window attention: limits the attended context window, capping KV cache size regardless of sequence length.
Streaming / paged KV cache: (vLLM-style) allocate KV cache in fixed pages, recycle across requests. Not always available on edge runtimes.

5. Activation memory

During the forward pass, each layer needs to store its input and intermediate activations for use in subsequent layers (and for backward pass, though you do not backpropagate during inference). The footprint depends on the largest activation tensor in the model.

For a transformer layer, the dominant activation is the attention score matrix:

attention_scores_bytes =
    batch_size × num_heads × seq_len × seq_len × bytes_per_element

For batch size 1, 32 heads, 4096 tokens, FP16:

1 × 32 × 4096 × 4096 × 2 = 1,073,741,824 bytes ≈ 1 GB

This is per-layer, but frameworks typically do not hold all layers in memory simultaneously — they process one layer at a time and reuse the buffer. So the peak activation memory is roughly one layer’s worth, not num_layers × one layer.

However, this still means 1 GB of activation memory for the attention scores alone, at a 4096-token context with batch size 1. At 32K tokens, that is 64 GB — which is why long-context inference at the edge requires either flash attention (which recomputes rather than stores) or very aggressive context limits.

Flash attention rewrites the attention computation to avoid materialising the full attention matrix, reducing peak activation memory from O(seq_len²) to O(seq_len). It is now standard in most edge-capable runtimes (llama.cpp, MLC, ExecuTorch). If your runtime supports it, always enable it.

6. Framework and runtime overhead

Every inference runtime carries a baseline memory cost before any model is loaded:

Runtime	Typical baseline	Notes
llama.cpp	~50 MB	Minimal; runs on bare metal
ONNX Runtime	100–300 MB	Depends on enabled execution providers
TensorRT-LLM	200–500 MB	Plugin libraries, CUDA graphs
PyTorch (eager)	500 MB – 1 GB	CUDA context, ATen ops, Python runtime
JetPack (Jetson)	400–600 MB	CUDA driver, multimedia stack

On a Jetson Orin NX 16 GB, the JetPack stack and OS together consume roughly 1.5–2 GB before your process starts. This is non-negotiable — it is the cost of the full Linux userspace that ships with JetPack.

7. Peak memory spikes

The numbers above are steady-state. Peak memory during a forward pass can be 20–40% higher than steady-state due to:

Temporary buffers for intermediate computations (softmax normalisation, layer norm)
Memory fragmentation in CUDA’s allocator — freed tensors leave gaps that are not immediately reusable
Graph compilation: TensorRT and torch.compile allocate large scratch buffers during the first few inferences (“warmup”)

Always add a 20–30% headroom buffer above your calculated steady-state estimate when sizing hardware.

8. Worked example: Llama-3 8B on Jetson Orin NX 16 GB

Model: Llama-3 8B (8 billion parameters) Quantization: INT4 weights, FP16 KV cache and activations Context: 4096 tokens Batch size: 1

Weights (INT4):          4.0 GB  (8B params × 0.5 bytes)
KV cache (FP16, 4096t):  0.5 GB
Peak activations (FA):   ~0.1 GB  (flash attention, no full matrix)
ONNX Runtime / TRT-LLM:  0.4 GB
OS + JetPack:            1.8 GB
────────────────────────────────
Steady-state total:      6.8 GB
+ 25% headroom:          8.5 GB

An 8 GB device (Orin NX 8 GB) would be marginal. The 16 GB variant is comfortable at this configuration. To run at batch size 4:

KV cache (batch=4):      2.0 GB  (↑ 1.5 GB from batch=1)
Adjusted total:          8.3 GB steady-state → ~10.4 GB with headroom

Batch size 4 fits on 16 GB but barely. Increasing context to 8192 tokens with batch size 4 would require 19+ GB — over the device limit.

9. A practical sizing checklist

Before ordering hardware or committing to a deployment target, run through these calculations:

Weights: num_params × bytes_per_param at your target precision
KV cache: 2 × layers × kv_heads × head_dim × max_ctx × batch_size × bytes — calculate for your maximum batch and context
Activations: Use the attention score formula; halve it if flash attention is available
Framework baseline: check the runtime you plan to use; benchmark on the actual device
OS footprint: measure on the actual OS image, not a generic Linux estimate
Peak headroom: add 25–30% to the total
Validate on device: run inference with your target batch size and context length, then profile with tegrastats (Jetson) or nvidia-smi (discrete GPU) to compare against your estimate

10. Summary

Component	Scales with	Controllable?
Weights	Model size, precision	Yes: quantization
KV cache	Batch size × context length × layers	Yes: GQA, KV quant, context limit
Activations	Batch size × seq_len² (or seq_len with FA)	Yes: flash attention
Framework overhead	Runtime choice	Partially: choose a lightweight runtime
OS footprint	Target OS image	Partially: minimal images help
Peak spike	Fragmentation, warmup	Add 25–30% headroom

The model card gives you one number. The memory budget has six. Sizing edge inference hardware without calculating all six leads to the most common failure in edge AI deployments: a model that fits in theory and OOMs in practice.