Docker for AI Engineers: From Dockerfile to Production Containers

The phrase "it works on my machine" has killed more AI projects than bad prompts ever will. Your RAG pipeline runs perfectly on your MacBook with Python 3.12, a locally installed FAISS library, and environment variables sourced from your shell profile. Your colleague clones the repo on their Windows machine, installs a different FAISS version, and gets retrieval results that are subtly wrong. The staging server uses Python 3.11, and the embedding dimensions silently mismatch. Nobody notices until a customer reports garbage answers.

Containerization is the bridge between "works on my machine" and "works in production." Docker packages your application, its dependencies, its runtime, and its configuration into a single, reproducible artifact that runs identically everywhere. For AI services with heavy native dependencies (PyTorch, FAISS, OpenCV, Tesseract), this is not a convenience. It is a requirement.

Docker Engine v29, the latest stable release as of 2026, defaults to the containerd image store, improving ecosystem alignment with Kubernetes and enabling faster innovation in image layer handling. Docker Desktop 4.61 introduced Gordon, an AI agent purpose-built for Docker that can debug container issues, generate Dockerfiles, and execute fixes with approval.

Writing Dockerfiles for Python AI Services

AI service Dockerfiles differ from standard Python application Dockerfiles in three important ways: the images are larger (ML libraries are heavy), the build time is longer (compiling native extensions), and GPU support adds configuration complexity. Multi-stage builds address all three.

The pattern separates the build stage (where you install dependencies and compile native extensions) from the runtime stage (which contains only the final application and its installed packages). This can reduce image size by 40 to 60 percent, because build tools like gcc, cmake, and development headers are not included in the production image.

Here is a production Dockerfile for a FastAPI service that serves a RAG pipeline:

# ============================================================
# Stage 1: Build stage with full toolchain
# ============================================================
FROM python:3.12-slim AS builder
 
WORKDIR /app
 
# Install build dependencies for native ML packages
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*
 
# Install uv for faster dependency resolution
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
 
# Copy dependency files first (cache layer optimization)
COPY pyproject.toml uv.lock ./
 
# Install dependencies into a virtual environment
# Using --frozen ensures the lock file is respected exactly
RUN uv sync --frozen --no-dev --no-install-project
 
# Copy application code
COPY src/ src/
 
# ============================================================
# Stage 2: Production runtime (minimal)
# ============================================================
FROM python:3.12-slim AS runtime
 
WORKDIR /app
 
# Copy only the virtual environment and application code
COPY --from=builder /app/.venv /app/.venv
COPY --from=builder /app/src /app/src
 
# Set the virtual environment PATH
ENV PATH="/app/.venv/bin:$PATH"
 
# Non-root user for security
RUN useradd --create-home appuser
USER appuser
 
# Health check for orchestrator integration
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python -c "import httpx; httpx.get('http://localhost:8000/health')" || exit 1
 
EXPOSE 8000
 
# Use uvicorn directly (no gunicorn needed for async FastAPI)
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

The key decisions in this Dockerfile are intentional. The uv sync --frozen command respects the lock file exactly, ensuring deterministic builds. The --no-dev flag excludes development dependencies. The dependency installation happens before the application code copy, so Docker's layer cache prevents reinstallation when only code changes. The non-root user prevents container escape vulnerabilities. The health check endpoint allows Kubernetes and Cloud Run to verify the service is ready for traffic.

Blog illustration

GPU-Aware Containers for ML Workloads

For services that require GPU access (model inference with PyTorch, local LLM hosting with vLLM, or embedding generation with sentence-transformers), the base image and runtime configuration change significantly.

NVIDIA provides base images (nvidia/cuda:12.x-runtime-ubuntu22.04) that include the CUDA runtime libraries. Your Dockerfile inherits from this image instead of python:3.12-slim, and you install Python and your dependencies on top. The host machine needs the NVIDIA Container Toolkit installed, and your docker run command (or docker-compose.yml) needs the --gpus flag.

Docker Desktop 4.61 introduced Docker Model Runner with vLLM Metal support, allowing developers to run LLM inference locally on Apple Silicon. For teams developing on Macs and deploying to NVIDIA-equipped servers, this means you can test locally with CPU inference and deploy with GPU acceleration using the same application code, switching only the Docker configuration.

Docker Compose for Local Multi-Service Development

Production AI services rarely run in isolation. A typical local development environment includes the FastAPI application, a PostgreSQL database with pgvector, a Redis cache for LLM response caching, and possibly a ChromaDB instance for vector storage. Docker Compose orchestrates all of these with a single docker compose up command.

services:
  api:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    env_file:
      - .env
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started
    volumes:
      - ./src:/app/src  # Hot reload for development
 
  db:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: ai_service
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
      interval: 5s
      timeout: 3s
      retries: 5
 
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
 
volumes:
  pgdata:

The pgvector/pgvector:pg16 image provides PostgreSQL 16 with the pgvector extension pre-installed, giving you relational data and vector search in one database. The health check on the database ensures the API service does not start until PostgreSQL is ready to accept connections. The volume mount on ./src enables hot-reloading during development without rebuilding the container.

What Breaks Without Containerization

The failure modes are predictable. Native library version mismatches between developer machines and production servers. Python version differences that change string handling, async behavior, or type system features. Missing system dependencies that were installed manually on one machine but never documented. Environment variable differences that produce silent behavior changes.

Docker eliminates all of these by making the environment part of the artifact. When you deploy a Docker image, you are deploying the exact environment that passed your tests. There is no "works on my machine" because your machine's environment is irrelevant. Only the container's environment matters.

Key Takeaways

Docker is the bridge between development and production for AI services. Multi-stage builds keep images lean while supporting the heavy native dependencies that ML libraries require. GPU-aware containers with NVIDIA base images extend this to inference workloads. Docker Compose provides the local multi-service environment that mirrors production architecture. Docker Engine v29's containerd default and Docker Desktop's Gordon AI agent represent the maturation of container tooling. Every AI service you build should be containerized from day one, not as an afterthought before deployment.

⚡ Version note: This guide covers Docker Engine v29 and Docker Desktop 4.61+ (February 2026). The Docker ecosystem evolves rapidly. Always check the official Docker documentation for the latest base images and runtime features.

Follow Usama Nawaz for weekly deep dives on building production grade AI systems.