Inference

2026

Writing a Minimal Inference Server in Go Jun 10

Python is the default language for inference servers, and for good reason: PyTorch, HuggingFace, and most ML tooling are Python-first. But if the rest of your stack is Go, you end up with a Python sidecar just to call model.forward(). That sidecar needs its own container, its own health checks, its own deployment pipeline, and its own debugging story.

2025

Running Inference at the Edge: The Memory Budget May 3

Your quantized model is listed at 4 GB on the model card. You load it onto an NVIDIA Jetson Orin with 8 GB of unified memory. It OOMs at 3.2 GB loaded. You reduce batch size to 1. Still OOM. You switch to INT4 weights. Finally it runs — but the throughput is half of what you expected.