
Running Inference at the Edge: The Memory Budget
1. Quick-look, “in a nutshell”
Your quantized model is listed at 4 GB on the model card. You load it onto an NVIDIA Jetson Orin with 8 GB of unified memory. It OOMs at 3.2 GB loaded. You reduce batch size to 1. Still OOM. You switch to INT4 weights. Finally it runs — but the throughput is half of what you expected.
The memory budget for edge inference is not the model size. The model size is the floor. Above it sit the KV cache, activation buffers, framework overhead, OS footprint, and peak memory spikes during decoding — none of which appear on the model card, and all of which can individually exceed the model’s own weight footprint.
This article builds a complete picture of where memory goes during inference, derives the formulas for each component, and applies them to a concrete example: Llama-3 8B on a Jetson Orin NX 16 GB.
2. The five memory consumers
Total memory = weights + KV cache + activations + framework overhead + OS/runtime
Most discussions stop at weights. The rest of the budget is equally important at the edge, where you are often trying to fit an inference workload into a fixed LPDDR5 or unified memory pool shared with the CPU and display subsystem.
3. Weights
The one everyone calculates:
weight_memory = num_parameters × bytes_per_parameter
| Precision | Bytes per parameter | 7B model |
|---|---|---|
| FP32 | 4 | 28 GB |
| FP16 / BF16 | 2 | 14 GB |
| INT8 | 1 | 7 GB |
| INT4 / MXFP4 | 0.5 | 3.5 GB |
For a 7B model in INT4, the weight floor is 3.5 GB. Everything else is additive.
4. KV cache
The KV cache stores the keys and values computed for each token in the attention layers, so they do not need to be recomputed on subsequent decode steps. It is the largest single source of surprise in LLM memory budgets.
kv_cache_bytes =
2 (key + value)
× num_layers
× num_kv_heads
× head_dim
× max_sequence_length
× batch_size
× bytes_per_element
For a Llama-3 8B model (32 layers, 8 KV heads, head dim 128) running at FP16, with a 4096-token context and batch size 1:
2 × 32 × 8 × 128 × 4096 × 1 × 2 bytes
= 2 × 32 × 8 × 128 × 4096 × 2
= 536,870,912 bytes
≈ 512 MB
At batch size 4, that is 2 GB — more than half the weight footprint. At a 32K context with batch size 4, it is 16 GB — more than the weights themselves.
The key insight: KV cache scales linearly with both batch size and context length. Increasing either has immediate memory consequences that are invisible from the model card.
On edge hardware with limited memory, KV cache is typically the binding constraint on maximum batch size and context length. Options for reducing it:
- Grouped-query attention (GQA): reduces
num_kv_heads(Llama-3 uses GQA, hence 8 KV heads instead of 32). Check whether your model uses MHA or GQA. - KV cache quantization: storing KV cache in INT8 instead of FP16 halves this cost with a small accuracy penalty.
- Sliding window attention: limits the attended context window, capping KV cache size regardless of sequence length.
- Streaming / paged KV cache: (vLLM-style) allocate KV cache in fixed pages, recycle across requests. Not always available on edge runtimes.
5. Activation memory
During the forward pass, each layer needs to store its input and intermediate activations for use in subsequent layers (and for backward pass, though you do not backpropagate during inference). The footprint depends on the largest activation tensor in the model.
For a transformer layer, the dominant activation is the attention score matrix:
attention_scores_bytes =
batch_size × num_heads × seq_len × seq_len × bytes_per_element
For batch size 1, 32 heads, 4096 tokens, FP16:
1 × 32 × 4096 × 4096 × 2 = 1,073,741,824 bytes ≈ 1 GB
This is per-layer, but frameworks typically do not hold all layers in memory simultaneously — they process one layer at a time and reuse the buffer. So the peak activation memory is roughly one layer’s worth, not num_layers × one layer.
However, this still means 1 GB of activation memory for the attention scores alone, at a 4096-token context with batch size 1. At 32K tokens, that is 64 GB — which is why long-context inference at the edge requires either flash attention (which recomputes rather than stores) or very aggressive context limits.
Flash attention rewrites the attention computation to avoid materialising the full attention matrix, reducing peak activation memory from O(seq_len²) to O(seq_len). It is now standard in most edge-capable runtimes (llama.cpp, MLC, ExecuTorch). If your runtime supports it, always enable it.
6. Framework and runtime overhead
Every inference runtime carries a baseline memory cost before any model is loaded:
| Runtime | Typical baseline | Notes |
|---|---|---|
| llama.cpp | ~50 MB | Minimal; runs on bare metal |
| ONNX Runtime | 100–300 MB | Depends on enabled execution providers |
| TensorRT-LLM | 200–500 MB | Plugin libraries, CUDA graphs |
| PyTorch (eager) | 500 MB – 1 GB | CUDA context, ATen ops, Python runtime |
| JetPack (Jetson) | 400–600 MB | CUDA driver, multimedia stack |
On a Jetson Orin NX 16 GB, the JetPack stack and OS together consume roughly 1.5–2 GB before your process starts. This is non-negotiable — it is the cost of the full Linux userspace that ships with JetPack.
7. Peak memory spikes
The numbers above are steady-state. Peak memory during a forward pass can be 20–40% higher than steady-state due to:
- Temporary buffers for intermediate computations (softmax normalisation, layer norm)
- Memory fragmentation in CUDA’s allocator — freed tensors leave gaps that are not immediately reusable
- Graph compilation: TensorRT and torch.compile allocate large scratch buffers during the first few inferences (“warmup”)
Always add a 20–30% headroom buffer above your calculated steady-state estimate when sizing hardware.
8. Worked example: Llama-3 8B on Jetson Orin NX 16 GB
Model: Llama-3 8B (8 billion parameters) Quantization: INT4 weights, FP16 KV cache and activations Context: 4096 tokens Batch size: 1
Weights (INT4): 4.0 GB (8B params × 0.5 bytes)
KV cache (FP16, 4096t): 0.5 GB
Peak activations (FA): ~0.1 GB (flash attention, no full matrix)
ONNX Runtime / TRT-LLM: 0.4 GB
OS + JetPack: 1.8 GB
────────────────────────────────
Steady-state total: 6.8 GB
+ 25% headroom: 8.5 GB
An 8 GB device (Orin NX 8 GB) would be marginal. The 16 GB variant is comfortable at this configuration. To run at batch size 4:
KV cache (batch=4): 2.0 GB (↑ 1.5 GB from batch=1)
Adjusted total: 8.3 GB steady-state → ~10.4 GB with headroom
Batch size 4 fits on 16 GB but barely. Increasing context to 8192 tokens with batch size 4 would require 19+ GB — over the device limit.
9. A practical sizing checklist
Before ordering hardware or committing to a deployment target, run through these calculations:
- Weights:
num_params × bytes_per_paramat your target precision - KV cache:
2 × layers × kv_heads × head_dim × max_ctx × batch_size × bytes— calculate for your maximum batch and context - Activations: Use the attention score formula; halve it if flash attention is available
- Framework baseline: check the runtime you plan to use; benchmark on the actual device
- OS footprint: measure on the actual OS image, not a generic Linux estimate
- Peak headroom: add 25–30% to the total
- Validate on device: run inference with your target batch size and context length, then profile with
tegrastats(Jetson) ornvidia-smi(discrete GPU) to compare against your estimate
10. Summary
| Component | Scales with | Controllable? |
|---|---|---|
| Weights | Model size, precision | Yes: quantization |
| KV cache | Batch size × context length × layers | Yes: GQA, KV quant, context limit |
| Activations | Batch size × seq_len² (or seq_len with FA) | Yes: flash attention |
| Framework overhead | Runtime choice | Partially: choose a lightweight runtime |
| OS footprint | Target OS image | Partially: minimal images help |
| Peak spike | Fragmentation, warmup | Add 25–30% headroom |
The model card gives you one number. The memory budget has six. Sizing edge inference hardware without calculating all six leads to the most common failure in edge AI deployments: a model that fits in theory and OOMs in practice.
