Edge-Ai

2026

Writing a Minimal Inference Server in Go Jun 10

Python is the default language for inference servers, and for good reason: PyTorch, HuggingFace, and most ML tooling are Python-first. But if the rest of your stack is Go, you end up with a Python sidecar just to call model.forward(). That sidecar needs its own container, its own health checks, its own deployment pipeline, and its own debugging story.

MXFP4 Quantization Jan 11

MXFP4 quantization is a microscaling, 4-bit floating-point compression scheme designed to shrink the memory footprint of deep-learning models without hurting accuracy.

2025

INT8 Quantization: What the Numbers Actually Mean May 20

You run a model in INT8 and it is 30% faster than FP32. You run the same model in INT8 with a different framework — same hardware, same batch size — and it is 2% faster. You run it again with a different calibration dataset and the accuracy drops by 4 points. Quantization is not one thing.

Running Inference at the Edge: The Memory Budget May 3

Your quantized model is listed at 4 GB on the model card. You load it onto an NVIDIA Jetson Orin with 8 GB of unified memory. It OOMs at 3.2 GB loaded. You reduce batch size to 1. Still OOM. You switch to INT4 weights. Finally it runs — but the throughput is half of what you expected.