Skip to content

Blog

This section shares practical insights from my work with AI, development, and tooling. You'll find articles on implementing RAG systems, deploying LLMs, optimization techniques, and lessons learned from building production systems.

Each post documents a specific problem solved or technique discovered—written as if teaching a colleague who will benefit from the solution.

Run LLMs Anywhere as a Single File with Docker

The Problem

Deploying large language models (LLMs) can be a hassle on local machines. It involves setting up Python environments, CUDA drivers, model downloads, and platform-specific quirks. But as a developer who often switches between different machines and operating systems (Ubuntu, Windows, and macOS), I needed something portable. Something that would run the same way everywhere.

The Solution: A Single Dockerfile

The answer lies in combining Mozilla’s Llamafiles with Docker. "Each llamafile contains both server code and model weights, making the deployment of an LLM as easy as downloading and executing a single file. It also leverages the popular llama.cpp project for fast model inference." Wrapping that in Docker gives us a universally portable container that runs the API server with a single command.