Linting, Formatting, and Type Checking for AI Codebases with Ruff and ty

The most insidious bugs in AI codebases are not the ones that crash your application. They are the ones where an LLM returns a response with a slightly different structure than you expected, your code silently handles it by falling through to a default case, and the user gets a plausible but wrong answer. No error in the logs. No exception in the trace. Just a quiet failure that erodes trust.

Type hints and static analysis are your first line of defense against this class of bug. When you declare that a function returns list[Document] and your LLM parsing code actually returns Optional[list[Document]], a type checker flags the mismatch before it ever reaches production. When your Pydantic model expects a confidence_score: float but the API sometimes returns it as a string, the linter catches the missing validation.

In 2026, the Python static analysis landscape has consolidated around Astral's toolchain. Ruff v0.15 (released February 2026 with the 2026 style guide) replaces Flake8, Black, isort, pydocstyle, and pyupgrade in a single tool that runs 10 to 100 times faster. Astral's ty (the new type checker, currently in preview) is positioned as an alternative to mypy and Pyright, also written in Rust for performance. Together with pre-commit hooks, they form a quality gate that catches bugs before they enter your repository.

Why AI Codebases Need Stricter Static Analysis

AI codebases have characteristics that make them more vulnerable to the bugs that static analysis catches. LLM responses are inherently unpredictable. Even with structured output modes and JSON schemas, edge cases slip through: unexpected null values, missing fields, additional fields you did not request, and type coercions that silently change meaning.

The pattern is consistent across every AI project I have worked on. A developer writes a function that processes an LLM response. They test it with the ten examples they have at hand. It works perfectly. In production, the 1,000th response has a slightly different structure (maybe the model added a preamble, maybe a field name has different casing), and the parsing code produces a wrong result instead of raising an error.

Type hints with strict checking catch this at development time. When you annotate your parsing function with precise return types and your Pydantic models with required fields and validators, the type checker verifies that every code path handles the types correctly. Branches that could produce None where a caller expects a value are flagged. Dict accesses without .get() defaults on potentially missing keys are caught.

Ruff: One Tool to Replace Them All

Ruff v0.15 (the current release as of March 2026) implements over 800 built-in rules, covering not just style and formatting but logical errors, security issues, and Python anti-patterns. The 2026 style guide update introduced block suppression comments (# ruff: disable[RULE] / # ruff: enable[RULE]), which are cleaner than the old # noqa inline comments.

For AI codebases, the most valuable Ruff rule categories are:

F (Pyflakes): Catches undefined names, unused imports, and unused variables. In AI code where you frequently experiment with different LangChain imports, unused imports accumulate fast.

E/W (pycodestyle): Enforces consistent style. When three developers each format their prompt templates differently, pull request diffs become unreadable.

I (isort): Organizes imports consistently. AI files often have 15 or more imports from different packages. Consistent ordering makes it immediately clear what dependencies a module uses.

S (Bandit): Catches security issues. Particularly relevant for AI codebases where developers sometimes use eval() to parse LLM-generated code or subprocess to run generated commands.

UP (pyupgrade): Suggests modern Python syntax. Replaces old-style string formatting with f-strings, updates type annotations to modern syntax, and removes compatibility shims for Python versions you no longer support.

The configuration lives in pyproject.toml, keeping everything in one place:

[tool.ruff]
target-version = "py312"
line-length = 100  # Slightly wider for AI code with long chain definitions
 
[tool.ruff.lint]
select = ["E", "F", "I", "S", "UP", "W", "B", "SIM", "RUF"]
ignore = ["E501"]  # Line length handled by formatter
 
[tool.ruff.format]
quote-style = "double"
indent-style = "space"

Blog illustration

Type Checking for LLM Response Handling

Type checking in AI codebases serves a different purpose than in traditional applications. The primary value is not catching typos or wrong argument orders (though it does that too). The primary value is forcing you to handle the uncertainty inherent in LLM outputs.

Consider a function that extracts structured data from an LLM response. Without type checking, you might write code that assumes the response always contains a results key with a list of dictionaries. With strict type checking, you are forced to handle the cases where results is missing, where it is an empty list, or where the dictionaries have unexpected shapes. Each of these cases maps to a real production scenario that your users will encounter.

Astral's ty type checker (currently in preview) is the newest entrant in this space. Written in Rust like Ruff, it promises significant speed improvements over mypy while maintaining compatibility with the existing type hint ecosystem. For teams already using the Astral toolchain (uv for environments, Ruff for linting and formatting), ty completes the picture with a consistent, fast developer experience.

For teams not ready to adopt ty, mypy with the --strict flag (or at minimum --disallow-untyped-defs) remains the established choice. Pyright (which powers Pylance in VS Code) offers the fastest feedback loop in the editor. The choice matters less than the commitment to running a type checker at all.

Pre-Commit Hooks: The Automated Quality Gate

Pre-commit hooks run automatically before every commit, catching issues before they enter the repository. For AI codebases, the essential hook configuration includes:

Ruff check and format: Running both the linter and formatter ensures every commit meets code quality standards. Ruff's speed (sub-second for most projects) means developers do not skip hooks out of impatience.

Type checking on changed files: Running the type checker on every commit can be slow for large codebases. The practical pattern is to run it on only the changed files in pre-commit and run the full check in CI.

Secret detection: The detect-secrets hook scans for patterns that look like API keys. For AI projects with multiple provider keys, this is a non-negotiable safety net.

Notebook output stripping: If your project includes Jupyter notebooks, nbstripout removes cell outputs to prevent data leakage and keep diffs clean.

The pre-commit configuration lives in .pre-commit-config.yaml at the project root. When a new developer clones the repository and runs pre-commit install, they immediately have the same quality gates as the rest of the team. No manual setup. No "I forgot to run the linter" in code reviews.

What Breaks Without Static Analysis

The pattern repeats across every AI project that skips static analysis. The codebase starts clean because it is small. As it grows, formatting diverges between developers. Import orders become random. Type hints are omitted "for speed." Unused imports accumulate. Security issues (hardcoded paths, eval() calls, broad exception handlers) hide in utility functions.

By the time someone introduces a static analysis tool six months later, it reports hundreds of violations. The team either fixes them all in a massive, risky refactoring PR, or they disable most rules and lose the benefit. Neither outcome is good.

The fix is simple: configure Ruff, enable type checking, and set up pre-commit hooks in the first hour of the project. The tool runs in milliseconds. The cost is zero. The value compounds with every commit.

Key Takeaways

Ruff v0.15 has consolidated the Python linting and formatting landscape into a single, blazingly fast tool that replaces five or more separate tools. For AI codebases, its security rules (Bandit) and modern syntax enforcement (pyupgrade) are particularly valuable. Type checking with ty, mypy, or Pyright catches the class of bugs most dangerous in AI applications: silent failures in LLM response handling. Pre-commit hooks automate the quality gate so developers cannot bypass it. Configure all three on day one. The sub-second runtime means there is no performance excuse, and the bugs they catch are the ones that reach production undetected.

⚡ Version note: This guide covers Ruff v0.15.5 (March 2026) and ty (Astral, preview). Both tools are actively developed. Always check the official Astral documentation for the latest features and rule sets.

Follow Usama Nawaz for weekly deep dives on building production grade AI systems.