Understanding LLM APIs and Model Families: A Production Guide for April 2026
It is April 17, 2026. Yesterday, Anthropic released Claude Opus 4.7 with a new tokenizer, a high-resolution vision pipeline, and task budgets for agentic loops. Last month, OpenAI shipped GPT-5.4. Three months ago, Meta previewed Muse Spark as a closed-weights successor to the open Llama line. The LLM landscape has changed more in the past six months than in the previous two years combined.
If you integrated an LLM into your product eighteen months ago and have not revisited the decision since, you are almost certainly using the wrong model for your workload today. This is not your fault. This is the field.
This post is the practitioner's map. It walks through the four model families that actually matter for production AI engineering right now, maps each to the workloads it serves well, and shows you the first piece of production integration code you should write on any new LLM project: a provider abstraction that lets you swap vendors without rewriting your application.
This is Section 2.1 of the AI Engineer's Field Guide. If you followed Chapter 1, you have a development environment that is ready to ship production AI. Now we start using it.
Why Model Selection Is an Architectural Decision
Most teams treat model selection like a shopping decision. They pick one model, wire it into their codebase, and only revisit the choice when something breaks or a bill arrives that makes someone unhappy. This framing is wrong and the consequences compound.
Consider what a model choice actually commits you to. It commits you to a specific tokenizer, which means your token counts and cost estimates are tied to that vendor's counting rules. It commits you to a specific context window, which shapes how much conversational history or document content you can feed per request. It commits you to a specific tool calling schema, a specific streaming protocol, a specific set of safety filters, a specific pricing curve, and a specific rate limit pool. Change any of these and you are migrating, not just reconfiguring.
Worse, the tradeoffs are workload-dependent. A model that is cheap and fast for short classification tasks may be prohibitively expensive for long-document summarization because its output pricing is high. A model with a 1M context window is wasted on 400-token customer support prompts. A model that excels on coding benchmarks may underperform on creative writing. You cannot pick one model that is optimal for all your production workloads, which means your architecture needs to accommodate routing between multiple models from day one.
This is the lens we use in the rest of this book. Models are infrastructure components, not products you shop for. You pick the one that fits a workload, measure its behavior, and swap it when something better appears. The provider abstraction we build later in this post is how you buy that flexibility cheaply.

The Production-Relevant LLM Landscape in April 2026
Four families of models matter for production AI engineering in April 2026. Anthropic's Claude 4.x family for complex reasoning and agentic work, OpenAI's GPT-5.4 family for multimodal and broad knowledge work, Meta's Llama 4 family for self-hosted and compliance-sensitive deployments, and the cloud-native offerings on AWS Bedrock, Google Vertex AI, and Microsoft Foundry that wrap these same models with enterprise guarantees. Everything else is either a variant, a wrapper, or a specialized tool.
Anthropic's Claude Family
Claude Opus 4.7 shipped on April 16, 2026 as Anthropic's most capable generally available model. The API identifier is claude-opus-4-7. Pricing is unchanged from Opus 4.6 at five dollars per million input tokens and twenty-five dollars per million output tokens. The context window remains one million tokens at flat standard pricing, with up to 128K output tokens per response. The underlying model now interprets prompts more literally, verifies its own outputs before returning, and accepts images up to 3.75 megapixels. That is roughly three times the resolution ceiling of Opus 4.6.
Two things about Opus 4.7 need to register with you as an engineer. First, the new tokenizer can produce up to thirty-five percent more tokens for the same input text compared to Opus 4.6. Your rate card did not change, but your effective cost per request can climb noticeably if your workloads are verbose. Second, several sampling parameters that worked in Opus 4.6, including explicit thinking budgets and certain temperature controls, have been removed or changed. Migration is not a drop-in replacement. Replay a representative traffic sample through 4.7 and measure token counts and response quality before cutting over.
Claude Sonnet 4.6 at claude-sonnet-4-6 remains the recommended default for the majority of production workloads. At three dollars per million input tokens and fifteen per million output, it is forty percent cheaper than Opus on both sides. It supports the same one million token context window at flat pricing. Unless you have measured a meaningful quality gap on your specific workload, Sonnet 4.6 is where you start. It is also where most production traffic should live even in organizations that do have Opus-tier workloads, because not every request in a system needs the flagship.
Claude Haiku 4.5 at claude-haiku-4-5 handles the high-volume, latency-sensitive end of your portfolio. One dollar per million input tokens, five per million output, and a 200K context window. Classification, routing, extraction, summarization, moderation. If a task can be done in under a thousand output tokens and does not need Opus-tier reasoning, Haiku is almost always the right answer. A well-architected production system often sends 60 to 70 percent of its requests to Haiku, 20 to 30 percent to Sonnet, and reserves the remainder for Opus on the hard problems. That routing strategy alone can cut total LLM spend by more than half versus a lazy all-Sonnet deployment.
OpenAI's GPT-5.4 Family
OpenAI shipped GPT-5.4 in early March 2026 as the successor to the GPT-5.3 Codex line. Base pricing lands at $2.50 per million input tokens and $15 per million output, with GPT-5.4-pro at six times that rate for deep reasoning workloads. GPT-5.4-mini at $0.25 per million input tokens and $2 per million output is the current price-performance sweet spot for lightweight production tasks. GPT-5.4-nano at $0.05 per million input tokens and $0.40 per million output pushes cost even lower for classification or extraction at scale.
The GPT-5.4 family ships with cached input pricing at ten percent of the base rate automatically. Any repeated prompt prefix becomes dramatically cheaper after the first call. No API flag, no code change. If your application sends the same system prompt, few-shot examples, or document prefix with every request, you hit this discount without trying. Most production chat applications see 30 to 50 percent cache-hit rates from their system prompt alone, which effectively halves input cost for the cached portion.
The Responses API has replaced the legacy Chat Completions API as the primary interface. Context windows top out at 1.05 million tokens on GPT-5.4 and GPT-5.4-pro, with prompts above 272K tokens triggering a 2x multiplier on input and a 1.5x multiplier on output for the entire session. This is a hidden cost of long-context GPT workloads that the headline rate card does not show.
Where GPT-5.4 shines: multimodal reasoning, code execution through the Responses API, and workloads that benefit from OpenAI's aggressive input caching. Where it does not shine: long-horizon autonomous agent work, where Claude Opus 4.7 measurably outperforms, and any workload with strict data residency requirements that OpenAI's regional endpoints charge a ten percent premium to accommodate.

Meta's Llama 4 Family
Llama 4 Maverick and Llama 4 Scout are the open-weight options you reach for when you cannot, or will not, send data to a third-party API. Scout is a seventeen billion active parameter model with sixteen experts and a ten million token context window that fits on a single H100 GPU with INT4 quantization. Maverick is a seventeen billion active parameter model with 128 experts and roughly 400 billion parameters total, requiring a full H100 host. Both are natively multimodal for text and image input, and both were released in April 2025.
The pricing picture for Llama 4 is different because you run it yourself. Compute costs depend on your GPU strategy, your quantization choices, and your batch sizes. Meta's own estimate lands Maverick at around 19 cents per million tokens for distributed inference, climbing to 30 to 49 cents on a single host. You pay in engineering complexity, GPU quota, and operational burden rather than per-token fees. For high-volume workloads where the operational cost is spread across billions of tokens per month, this math can favor Llama. For low-volume workloads, the managed APIs almost always win.
One caveat worth flagging. In April 2026, Meta announced Muse Spark as a closed-weights successor from Meta Superintelligence Labs, and the open-weight future of the Llama line is no longer guaranteed. Zuckerberg mentioned plans to release "increasingly advanced models that push the frontier of intelligence and capabilities, including new open source models," but provided no timeline and no specific commitment about which models or when. If you are building on Llama 4 today, you are building on a snapshot, not a roadmap. Plan accordingly.
Claude on AWS Bedrock, Vertex AI, and Microsoft Foundry
Claude Opus 4.7 became available on Amazon Bedrock on April 16, 2026 alongside the direct Anthropic API release. The Bedrock model ID is anthropic.claude-opus-4-7-v1:0. Default quotas start at 10M tokens per minute on bedrock-mantle and 15M on bedrock-runtime for each supported region. Google Vertex AI publishes the same model as claude-opus-4-7@20260416. Microsoft Foundry lists it in the Azure AI Foundry catalog.
Bedrock is where you route when your data must not leave AWS, when you need enterprise contractual guarantees, or when your existing IAM story is too valuable to bypass. The per-token cost is the same as direct Anthropic API access, though the cloud providers may add a margin for enterprise features. Global endpoints route dynamically across regions for availability. Regional endpoints pin a workload to one region at a ten percent premium.
For teams that already have their AWS or GCP security posture locked down, routing Claude through Bedrock or Vertex is often simpler than setting up a parallel vendor relationship with Anthropic. For teams without that existing investment, direct API access is both cheaper and faster to get working.
The Landscape at a Glance
Before we write any code, absorb the shape of the field as of April 2026.
| Model | Provider | Input $/MTok | Output $/MTok | Context | Best For |
|---|---|---|---|---|---|
| Claude Opus 4.7 | Anthropic | $5.00 | $25.00 | 1M | Long-horizon agents, complex coding |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | 1M | Default production workload |
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | 200K | High-volume classification, routing |
| GPT-5.4 | OpenAI | $2.50 | $15.00 | 1.05M | Multimodal, broad knowledge work |
| GPT-5.4-mini | OpenAI | $0.25 | $2.00 | 400K | Low-cost Q and A at scale |
| GPT-5.4-nano | OpenAI | $0.05 | $0.40 | 400K | Extraction, classification |
| Llama 4 Maverick | Meta (OSS) | Self-host | Self-host | 1M | Self-hosted, compliance-sensitive |
| Claude on Bedrock | AWS | $5.00* | $25.00* | 1M | AWS-resident enterprise data |
Bedrock adds cloud margin in some configurations; direct Anthropic API is typically the cheapest route for Opus 4.7. Self-hosted Llama 4 pricing depends entirely on your GPU strategy and batch sizes.
The Four Axes of Model Selection
Every production model choice is a walk across four axes: quality, cost, latency, and operational fit. Treat them separately.
Quality
Quality is not a scalar. A model can be excellent at code reasoning and mediocre at creative writing. Opus 4.7 currently posts 87.6 percent on SWE-bench Verified, 64.3 percent on SWE-bench Pro, and 70 percent on CursorBench. GPT-5.4 leads on certain multimodal benchmarks. None of these numbers tell you how well a model will do on your specific task.
Build a workload-specific evaluation set the moment you commit to a model, and rerun it on every provider you consider. Twenty representative prompts with known-good answers is enough to start. The point is not benchmark supremacy. The point is measured evidence that a model does your job well enough. I have seen teams spend weeks debating which model scores better on SWE-bench, only to discover their actual workload is customer-support classification where the benchmark leader is 2x more expensive for zero measurable quality gain.
Cost
Cost has three components the rate card does not show directly: tokens per request, caching effectiveness, and output verbosity. A model at fifty percent of the headline rate can end up more expensive if its default output is twice as long. A model with a thirty-five percent larger tokenizer footprint spends thirty-five percent more per request even at identical listed prices. Always compute your real cost on a representative sample, and always factor in prompt caching and batch discounts, which can cumulatively cut effective costs by 75 to 90 percent on well-architected workloads.
The single biggest cost lever most teams ignore is routing. A 70/20/10 split between Haiku, Sonnet, and Opus instead of all-Sonnet cuts total Claude spend by more than half on typical production traffic. Most requests do not need Opus-tier reasoning. Fighting that instinct is where cost discipline begins.
Latency
Claude Haiku 4.5 returns first tokens within a few hundred milliseconds. Claude Opus 4.7 can take several seconds on complex reasoning workloads with higher effort levels. GPT-5.4-nano is built for speed. This matters less than teams expect for asynchronous workloads and matters much more than teams expect for user-facing chat UX. If your users are watching a streaming response arrive, a two-second first-token latency feels sluggish no matter how good the final output is. Measure time-to-first-token, not just total latency.
Operational Fit
Operational fit covers the boring stuff that kills weekends. Does your compliance team allow data to leave us-east-1? Does your vendor offer PII redaction at the API layer or do you have to build it yourself? Does the provider's outage history match your SLA promises to customers? Is the SDK maintained in the languages your team actually uses? Will the model identifier you wire in today still work in eighteen months? These questions are not exciting, but they determine whether your AI feature ships and stays shipped.

The Production Pattern: A Provider Abstraction
Here is the first production code we will write together. This is a provider abstraction that normalizes Claude and GPT-5.4 behind a single interface, so the rest of your application never branches on which vendor is behind a given call. The goal is not to write the ultimate LLM framework. LangChain and LlamaIndex do that. The goal is to show the shape of the minimum viable contract every production LLM integration needs: typed responses, structured logging, real error handling, and token accounting by default.
# providers/base.py
from __future__ import annotations
import abc
import logging
import os
import time
from dataclasses import dataclass, field
from typing import Any
logger = logging.getLogger(__name__)
@dataclass
class LLMResponse:
"""Normalized response across providers. Callers never touch raw SDK objects."""
text: str
model: str
provider: str
input_tokens: int
output_tokens: int
latency_ms: int
stop_reason: str
metadata: dict[str, Any] = field(default_factory=dict)
@property
def total_tokens(self) -> int:
return self.input_tokens + self.output_tokens
class LLMProvider(abc.ABC):
"""Every vendor implements this contract so callers never branch on provider."""
name: str
default_model: str
@abc.abstractmethod
def complete(
self,
prompt: str,
system: str | None = None,
model: str | None = None,
max_tokens: int = 1024,
temperature: float = 0.2,
) -> LLMResponse:
...
# providers/anthropic_provider.py
import anthropic
from anthropic import APIError, APIConnectionError, RateLimitError
class ClaudeProvider(LLMProvider):
"""Claude Opus 4.7 as the default. Sonnet 4.6 for most production workloads."""
name = "anthropic"
default_model = "claude-opus-4-7" # Released April 16, 2026
def __init__(self) -> None:
# Never hardcode API keys. Reads ANTHROPIC_API_KEY from the environment.
self._client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def complete(
self,
prompt: str,
system: str | None = None,
model: str | None = None,
max_tokens: int = 1024,
temperature: float = 0.2,
) -> LLMResponse:
start = time.perf_counter()
try:
kwargs: dict[str, Any] = {
"model": model or self.default_model,
"max_tokens": max_tokens,
"temperature": temperature,
"messages": [{"role": "user", "content": prompt}],
}
if system is not None:
kwargs["system"] = system
msg = self._client.messages.create(**kwargs)
except RateLimitError:
# Caller layer decides whether to back off, queue, or fall back.
logger.warning("anthropic_rate_limited", extra={"model": model})
raise
except (APIConnectionError, APIError) as e:
logger.error("anthropic_api_error", extra={"err": str(e)}, exc_info=True)
raise
latency = int((time.perf_counter() - start) * 1000)
text = "".join(block.text for block in msg.content if block.type == "text")
return LLMResponse(
text=text,
model=msg.model,
provider=self.name,
input_tokens=msg.usage.input_tokens,
output_tokens=msg.usage.output_tokens,
latency_ms=latency,
stop_reason=msg.stop_reason or "end_turn",
metadata={"id": msg.id},
)
# providers/openai_provider.py
from openai import OpenAI, APIError as OAIAPIError, RateLimitError as OAIRateLimitError
class OpenAIProvider(LLMProvider):
"""GPT-5.4 family through the Responses API (the current primary)."""
name = "openai"
default_model = "gpt-5.4"
def __init__(self) -> None:
self._client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def complete(
self,
prompt: str,
system: str | None = None,
model: str | None = None,
max_tokens: int = 1024,
temperature: float = 0.2,
) -> LLMResponse:
start = time.perf_counter()
inputs: list[dict[str, Any]] = []
if system:
inputs.append({"role": "system", "content": system})
inputs.append({"role": "user", "content": prompt})
try:
resp = self._client.responses.create(
model=model or self.default_model,
input=inputs,
max_output_tokens=max_tokens,
temperature=temperature,
)
except OAIRateLimitError:
logger.warning("openai_rate_limited", extra={"model": model})
raise
except OAIAPIError as e:
logger.error("openai_api_error", extra={"err": str(e)}, exc_info=True)
raise
latency = int((time.perf_counter() - start) * 1000)
return LLMResponse(
text=resp.output_text,
model=resp.model,
provider=self.name,
input_tokens=resp.usage.input_tokens,
output_tokens=resp.usage.output_tokens,
latency_ms=latency,
stop_reason=getattr(resp, "status", "completed"),
metadata={"id": resp.id},
)
# usage.py
def run_example() -> None:
logging.basicConfig(level=logging.INFO)
# Same caller contract regardless of vendor. Swap providers without changing code.
claude = ClaudeProvider()
r = claude.complete(
prompt="Summarize transformer self-attention in two sentences.",
system="You are a precise ML educator. Answer tersely.",
max_tokens=256,
temperature=0.1,
)
# Every call is measurable: tokens, latency, cost attribution.
logger.info(
"llm_call_complete",
extra={
"provider": r.provider,
"model": r.model,
"input_tokens": r.input_tokens,
"output_tokens": r.output_tokens,
"latency_ms": r.latency_ms,
},
)
print(r.text)Read that code with a production eye. Every design decision has a reason.
The LLMResponse dataclass exists so your downstream code never touches vendor-specific objects. Token counts are surfaced as first-class fields because without them you cannot track cost. Latency is measured on the caller side rather than trusted from vendor metadata because vendor clocks are not your clocks. The provider interface uses an abc.ABC contract so the type checker flags any missing implementation before you ship. Rate limit exceptions are re-raised rather than swallowed because the caller needs to decide whether to back off, queue, or fail over, and that is not a provider-level decision.
Notice what the code does not do. It does not implement retries. It does not implement automatic failover between providers. It does not cache. Those are the next layers of the stack and we will build them in subsequent sections. What this gives you is the foundational abstraction: a single function signature, a single response shape, a single place where every LLM call is logged with the data you need to debug production.

What Breaks in Production
Here is what most teams write the first time they call an LLM API. This is the pattern I find in almost every first-month production incident review.
import anthropic
# Hardcoded key checked into git. Breaks the moment the repo is shared.
client = anthropic.Anthropic(api_key="sk-ant-api03-abc123-LEAKED")
def ask(prompt):
# Using a model identifier that no longer exists.
r = client.messages.create(
model="claude-3-opus",
messages=[{"role": "user", "content": prompt}],
)
# No error handling, no token accounting, no latency tracking.
# If Anthropic throttles us at 3 a.m. the whole service goes down.
return r.content[0].textEvery single thing about that snippet will hurt you in production.
The API key is committed in source code and will leak the instant the repo is shared, forked, or indexed by any secret-scanning tool. I have watched a team spend twenty-four hours rotating keys and invalidating cached credentials in three clouds because one junior developer pushed a hardcoded key to a public mirror by accident. The cost of that incident was real engineer days plus real customer-trust damage.
The model identifier claude-3-opus no longer exists. The call will fail and your users will see 500s in the UI. Your on-call gets paged. You discover the issue the same way everyone does: in production.
There is no timeout, no retry, and no circuit breaker, so a single Anthropic rate limit event during a traffic spike brings down your entire service for the length of the window. There is no token accounting, so you will not know your API bill is doubling until the invoice arrives. There is no structured logging, so when a response is wrong you have no way to reproduce it. "It works on my machine" is now a production debugging technique.
The production version earlier in this post enforces a checklist by design: API keys read from environment variables, current model identifiers verified at call time, errors separated into rate limit vs transport vs API vs unknown classes, every call tagged with provider / model / tokens / latency, vendor-specific response objects never leaking past the provider boundary, and a response shape that contains every field needed for cost attribution. This is the floor, not the ceiling. Retries, circuit breakers, caching, and failover all sit on top of this foundation. We build them in the sections that follow.
Version Note
⚡ Version note: This section uses anthropic Python SDK v0.96.0 (released April 16, 2026) and openai Python SDK v2.32.0 (released April 15, 2026). Both SDKs are moving quickly. The Anthropic SDK cadence has been roughly one minor release per week through Q1 2026. The OpenAI SDK moves even faster. Pin exact versions in your
requirements.txtorpyproject.tomland upgrade deliberately rather than automatically. Model identifiers in this section reflect the April 17, 2026 state of the world.
Key Takeaways
Five things to carry forward from this section.
Model selection is an architectural decision, not a shopping one. It commits you to a tokenizer, context window, tool schema, and pricing curve. Design for routing between multiple models from day one.
Default to Claude Sonnet 4.6 for most production workloads. Reach for Opus 4.7 when measured quality gaps justify 1.67x the cost. Reach for Haiku 4.5 for classification, routing, and high-volume tasks. A 70/20/10 routing split between Haiku, Sonnet, and Opus often cuts total spend by more than half versus an all-Sonnet deployment.
A provider abstraction is the cheapest insurance policy you can buy. Thirty lines of code on day one saves a week of migration pain six months later when the landscape shifts, and it will shift.
Measure quality on your own workload. Public benchmarks are not your workload. Twenty representative prompts with known-good answers is the minimum viable eval set, and it beats every leaderboard for your specific decision.
Never hardcode API keys, never use stale model identifiers, never ship without structured logging. These three sins cause ninety percent of first-month LLM production incidents. Refuse all three in code review.
What's Next
The next section, 2.2, walks through making your first structured LLM API call end-to-end: proper error handling with exponential backoff, structured logging of every request and response, timeout configuration, and the retry patterns that turn this provider abstraction from a clean interface into a production-hardened component. We also look at how to test LLM-integrating code when every call is non-deterministic and expensive.
If you are reading this in order, that is your next stop. If you are dropping in to find an answer to a specific question, you can skim the section map in the book outline and jump to what you need.
Follow Usama Nawaz for weekly deep dives on building production-grade AI systems. The full AI Engineer's Field Guide is published as a chapter-by-chapter series across Substack, Medium, and LinkedIn. This is section 2.1.



