Running Inference at the Edge: The Memory Budget

Running Inference at the Edge: The Memory Budget

1. Quick-look, “in a nutshell”

Your quantized model is listed at 4 GB on the model card. You load it onto an NVIDIA Jetson Orin with 8 GB of unified memory. It OOMs at 3.2 GB loaded. You reduce batch size to 1. Still OOM. You switch to INT4 weights. Finally it runs — but the throughput is half of what you expected.

The memory budget for edge inference is not the model size. The model size is the floor. Above it sit the KV cache, activation buffers, framework overhead, OS footprint, and peak memory spikes during decoding — none of which appear on the model card, and all of which can individually exceed the model’s own weight footprint.

This article builds a complete picture of where memory goes during inference, derives the formulas for each component, and applies them to a concrete example: Llama-3 8B on a Jetson Orin NX 16 GB.


2. The five memory consumers

Total memory = weights + KV cache + activations + framework overhead + OS/runtime

Most discussions stop at weights. The rest of the budget is equally important at the edge, where you are often trying to fit an inference workload into a fixed LPDDR5 or unified memory pool shared with the CPU and display subsystem.


3. Weights

The one everyone calculates:

weight_memory = num_parameters × bytes_per_parameter
PrecisionBytes per parameter7B model
FP32428 GB
FP16 / BF16214 GB
INT817 GB
INT4 / MXFP40.53.5 GB

For a 7B model in INT4, the weight floor is 3.5 GB. Everything else is additive.


4. KV cache

The KV cache stores the keys and values computed for each token in the attention layers, so they do not need to be recomputed on subsequent decode steps. It is the largest single source of surprise in LLM memory budgets.

kv_cache_bytes =
    2          (key + value)
  × num_layers
  × num_kv_heads
  × head_dim
  × max_sequence_length
  × batch_size
  × bytes_per_element

For a Llama-3 8B model (32 layers, 8 KV heads, head dim 128) running at FP16, with a 4096-token context and batch size 1:

2 × 32 × 8 × 128 × 4096 × 1 × 2 bytes
= 2 × 32 × 8 × 128 × 4096 × 2
= 536,870,912 bytes
≈ 512 MB

At batch size 4, that is 2 GB — more than half the weight footprint. At a 32K context with batch size 4, it is 16 GB — more than the weights themselves.

The key insight: KV cache scales linearly with both batch size and context length. Increasing either has immediate memory consequences that are invisible from the model card.

On edge hardware with limited memory, KV cache is typically the binding constraint on maximum batch size and context length. Options for reducing it:

  • Grouped-query attention (GQA): reduces num_kv_heads (Llama-3 uses GQA, hence 8 KV heads instead of 32). Check whether your model uses MHA or GQA.
  • KV cache quantization: storing KV cache in INT8 instead of FP16 halves this cost with a small accuracy penalty.
  • Sliding window attention: limits the attended context window, capping KV cache size regardless of sequence length.
  • Streaming / paged KV cache: (vLLM-style) allocate KV cache in fixed pages, recycle across requests. Not always available on edge runtimes.

5. Activation memory

During the forward pass, each layer needs to store its input and intermediate activations for use in subsequent layers (and for backward pass, though you do not backpropagate during inference). The footprint depends on the largest activation tensor in the model.

For a transformer layer, the dominant activation is the attention score matrix:

attention_scores_bytes =
    batch_size × num_heads × seq_len × seq_len × bytes_per_element

For batch size 1, 32 heads, 4096 tokens, FP16:

1 × 32 × 4096 × 4096 × 2 = 1,073,741,824 bytes ≈ 1 GB

This is per-layer, but frameworks typically do not hold all layers in memory simultaneously — they process one layer at a time and reuse the buffer. So the peak activation memory is roughly one layer’s worth, not num_layers × one layer.

However, this still means 1 GB of activation memory for the attention scores alone, at a 4096-token context with batch size 1. At 32K tokens, that is 64 GB — which is why long-context inference at the edge requires either flash attention (which recomputes rather than stores) or very aggressive context limits.

Flash attention rewrites the attention computation to avoid materialising the full attention matrix, reducing peak activation memory from O(seq_len²) to O(seq_len). It is now standard in most edge-capable runtimes (llama.cpp, MLC, ExecuTorch). If your runtime supports it, always enable it.


6. Framework and runtime overhead

Every inference runtime carries a baseline memory cost before any model is loaded:

RuntimeTypical baselineNotes
llama.cpp~50 MBMinimal; runs on bare metal
ONNX Runtime100–300 MBDepends on enabled execution providers
TensorRT-LLM200–500 MBPlugin libraries, CUDA graphs
PyTorch (eager)500 MB – 1 GBCUDA context, ATen ops, Python runtime
JetPack (Jetson)400–600 MBCUDA driver, multimedia stack

On a Jetson Orin NX 16 GB, the JetPack stack and OS together consume roughly 1.5–2 GB before your process starts. This is non-negotiable — it is the cost of the full Linux userspace that ships with JetPack.


7. Peak memory spikes

The numbers above are steady-state. Peak memory during a forward pass can be 20–40% higher than steady-state due to:

  • Temporary buffers for intermediate computations (softmax normalisation, layer norm)
  • Memory fragmentation in CUDA’s allocator — freed tensors leave gaps that are not immediately reusable
  • Graph compilation: TensorRT and torch.compile allocate large scratch buffers during the first few inferences (“warmup”)

Always add a 20–30% headroom buffer above your calculated steady-state estimate when sizing hardware.


8. Worked example: Llama-3 8B on Jetson Orin NX 16 GB

Model: Llama-3 8B (8 billion parameters) Quantization: INT4 weights, FP16 KV cache and activations Context: 4096 tokens Batch size: 1

Weights (INT4):          4.0 GB  (8B params × 0.5 bytes)
KV cache (FP16, 4096t):  0.5 GB
Peak activations (FA):   ~0.1 GB  (flash attention, no full matrix)
ONNX Runtime / TRT-LLM:  0.4 GB
OS + JetPack:            1.8 GB
────────────────────────────────
Steady-state total:      6.8 GB
+ 25% headroom:          8.5 GB

An 8 GB device (Orin NX 8 GB) would be marginal. The 16 GB variant is comfortable at this configuration. To run at batch size 4:

KV cache (batch=4):      2.0 GB  (↑ 1.5 GB from batch=1)
Adjusted total:          8.3 GB steady-state → ~10.4 GB with headroom

Batch size 4 fits on 16 GB but barely. Increasing context to 8192 tokens with batch size 4 would require 19+ GB — over the device limit.


9. A practical sizing checklist

Before ordering hardware or committing to a deployment target, run through these calculations:

  1. Weights: num_params × bytes_per_param at your target precision
  2. KV cache: 2 × layers × kv_heads × head_dim × max_ctx × batch_size × bytes — calculate for your maximum batch and context
  3. Activations: Use the attention score formula; halve it if flash attention is available
  4. Framework baseline: check the runtime you plan to use; benchmark on the actual device
  5. OS footprint: measure on the actual OS image, not a generic Linux estimate
  6. Peak headroom: add 25–30% to the total
  7. Validate on device: run inference with your target batch size and context length, then profile with tegrastats (Jetson) or nvidia-smi (discrete GPU) to compare against your estimate

10. Summary

ComponentScales withControllable?
WeightsModel size, precisionYes: quantization
KV cacheBatch size × context length × layersYes: GQA, KV quant, context limit
ActivationsBatch size × seq_len² (or seq_len with FA)Yes: flash attention
Framework overheadRuntime choicePartially: choose a lightweight runtime
OS footprintTarget OS imagePartially: minimal images help
Peak spikeFragmentation, warmupAdd 25–30% headroom

The model card gives you one number. The memory budget has six. Sizing edge inference hardware without calculating all six leads to the most common failure in edge AI deployments: a model that fits in theory and OOMs in practice.