Skip to main content
Back to blog

Getting structured output from language models

·11 min readAI

The biggest frustration with using LLMs in applications is getting structured output. You ask for JSON and get JSON wrapped in markdown code blocks. You ask for specific fields and get extra commentary. Building reliable pipelines on top of LLM output requires techniques that go beyond "please return JSON."

I have gone through every stage of this: prompt hacking, regex parsing, retry loops, and finally the constrained decoding approaches that actually work. Here is what I have learned.

The problem

LLMs generate text token by token. They do not inherently understand schemas, types, or structure. When you ask for JSON, the model is pattern-matching based on training data, not validating against a schema. This means output can be almost-right: a missing comma, an extra field, a string where you expected a number.

For a chatbot, almost-right is fine. For an application that parses the output, almost-right is broken.

Prompt engineering (the baseline)

The simplest approach is explicit formatting instructions in your prompt:

Extract the following fields from this text and return ONLY valid JSON with no other text:
{
  "name": string,
  "email": string,
  "company": string or null
}

Text: "Hi, I'm Sarah Chen from Acme Corp. Reach me at sarah@acme.com"

This works maybe 90% of the time with good models. The other 10% is what breaks your pipeline. The model wraps it in markdown code fences, adds a helpful explanation before the JSON, or returns the right shape with a wrong type. You end up writing fragile parsing code that strips backticks and extracts JSON from surrounding text.

For prototyping and one-off scripts, prompt engineering is fine. For production, you need something better.

How constrained decoding works

Before looking at specific APIs, it helps to understand what is happening under the hood. Every major provider now offers some form of "structured output" that guarantees valid JSON matching a schema. They all work on the same principle: constrained decoding.

When an LLM generates text, it produces a probability distribution over all possible next tokens at each step. Normally, the model samples from this distribution freely. With constrained decoding, a grammar engine sits between the model output and the sampling step. It checks which tokens are valid continuations given the current position in the schema, and masks out everything else. If the model just produced {"name": "Sarah, the only valid next tokens are more string characters or a closing quote. An opening brace or a number would be masked to zero probability.

The interesting engineering is in making this fast. Naive grammar checking at every token is expensive. Modern engines like XGrammar (used by several providers) precompute validity masks for 99% of the vocabulary, achieving under 40 microseconds of overhead per token. Microsoft's llguidance engine (which OpenAI credited as foundational to their implementation) uses a lazy automata approach that achieves similar performance. For practical purposes, constrained decoding adds negligible latency to generation.

For local models, Ollama passes a GBNF grammar (llama.cpp's grammar format) to the inference engine, which applies the same masking approach at the token level.

Provider-specific approaches

OpenAI: Structured Outputs

OpenAI shipped Structured Outputs in August 2024. There are two surfaces:

Response format with JSON schema. Pass response_format with a JSON schema, and with strict: true, the model is guaranteed to produce valid JSON matching that schema.

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Extract contact info from: ..." }],
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "contact",
      strict: true,
      schema: {
        type: "object",
        properties: {
          name: { type: "string" },
          email: { type: "string" },
          company: { type: "string", nullable: true },
        },
        required: ["name", "email", "company"],
        additionalProperties: false,
      },
    },
  },
});

Function calling with strict mode. Set strict: true inside a tool definition, and the model's tool call arguments are guaranteed to match the schema. This is useful when you are already using tool calling for other reasons.

There are some important restrictions on OpenAI's JSON Schema support: additionalProperties must be false on every object (including nested ones), all fields must be listed in required, and recursive schemas with $ref are not supported. These limitations exist because the constrained decoding engine needs to compile the schema into a finite grammar.

Anthropic: Structured Outputs

Claude added native structured outputs in late 2025, now GA for Opus, Sonnet, and Haiku. Before that, the workaround was defining a "tool" whose input schema described the JSON you wanted, then having the model "call" that tool. It worked but felt hacky.

The native approach uses the same constrained decoding principle. You can use it with direct JSON schema output or with strict tool use where tool parameters are guaranteed via schema constraints. The first request for a given schema has extra latency while the grammar is compiled, but Anthropic caches compiled grammars for 24 hours.

Google Gemini

Gemini supports structured output via response_schema in the generation config. Recent updates added full JSON Schema support including anyOf, $ref, and property ordering. It works with Pydantic and Zod schemas out of the box.

Ollama (local models)

When you set the format parameter, Ollama passes a grammar to the llama.cpp engine:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Extract name and email from: Hi, I am Sarah Chen, sarah@acme.com",
  "format": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "email": { "type": "string" }
    },
    "required": ["name", "email"]
  }
}'

The quality depends on the model, but the structure is guaranteed. One important detail: the model does not see the format schema as context. The grammar only constrains, it does not guide intent. You should still instruct the model to output JSON in your prompt so it generates meaningful values, not just structurally valid gibberish.

Zod schemas with the Vercel AI SDK

The Vercel AI SDK has the cleanest pattern for typed structured output in TypeScript. You define a Zod schema, and the SDK handles conversion to JSON Schema, API calls, and response validation:

import { generateObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
 
const { object } = await generateObject({
  model: openai("gpt-4o"),
  schema: z.object({
    name: z.string(),
    email: z.string().email(),
    company: z.string().nullable(),
  }),
  prompt: "Extract contact info from: ...",
});
 
// object is fully typed: { name: string, email: string, company: string | null }

The schema is defined once (in Zod), used for API validation, converted to JSON Schema for the provider, and gives you TypeScript types. The .describe() method on Zod fields adds context that helps the model understand what each field should contain:

const schema = z.object({
  sentiment: z
    .enum(["positive", "negative", "neutral"])
    .describe("The overall sentiment of the text"),
  confidence: z
    .number()
    .min(0)
    .max(1)
    .describe("Confidence score between 0 and 1"),
  topics: z
    .array(z.string())
    .describe("Key topics mentioned in the text"),
});

For streaming structured output, streamObject yields partial objects as tokens arrive, so you can update a UI progressively:

import { streamObject } from "ai";
 
const { partialObjectStream } = streamObject({
  model: openai("gpt-4o"),
  schema,
  prompt: "Analyze this article: ...",
});
 
for await (const partial of partialObjectStream) {
  // partial might be { sentiment: "positive" } then
  // { sentiment: "positive", confidence: 0.92 } etc.
  updateUI(partial);
}

Why Zod dominates the TypeScript AI ecosystem

Zod and JSON Schema are complementary, not competing. JSON Schema is the wire format that every provider's API consumes. Zod is a TypeScript-first schema library that provides runtime validation, TypeScript type inference, and a developer-friendly API for defining schemas.

Libraries like zod-to-json-schema bridge the two. The AI SDK, OpenAI's Node SDK, and Instructor (TypeScript) all use this under the hood. You write Zod, the library converts to JSON Schema for the API call, and validates the response with Zod on the way back. Define once, get three things: the TypeScript type, the runtime validator, and the LLM schema.

The Instructor pattern (Python)

In Python, Instructor by Jason Liu is the equivalent. You define Pydantic models and get validated, typed output:

import instructor
from openai import OpenAI
from pydantic import BaseModel
 
client = instructor.from_openai(OpenAI())
 
class Contact(BaseModel):
    name: str
    email: str
    company: str | None
 
contact = client.chat.completions.create(
    model="gpt-4o",
    response_model=Contact,
    messages=[{"role": "user", "content": "Extract: Sarah Chen, sarah@acme.com, Acme Corp"}],
)
# contact is a validated Contact instance

Instructor wraps your provider's SDK, converts the Pydantic model to a JSON Schema, calls the API, validates the response, and automatically retries with the validation error in the prompt if validation fails. It works with 15+ providers including OpenAI, Anthropic, Google, Ollama, and DeepSeek.

Common patterns

Classification

The simplest use case. Define an enum schema, and the model picks a category. This works extremely reliably because the output space is tiny.

const { object } = await generateObject({
  model: openai("gpt-4o"),
  schema: z.object({
    category: z.enum(["bug", "feature_request", "question", "complaint"]),
    priority: z.enum(["low", "medium", "high", "critical"]),
  }),
  prompt: `Classify this support ticket: "${ticketText}"`,
});

Entity extraction

Pass a document and a schema describing the entities you want. The model fills in the fields.

const { object } = await generateObject({
  model: openai("gpt-4o"),
  schema: z.object({
    people: z.array(
      z.object({
        name: z.string(),
        role: z.string().nullable(),
        organization: z.string().nullable(),
      })
    ),
    dates: z.array(z.string().describe("ISO 8601 format")),
    amounts: z.array(
      z.object({
        value: z.number(),
        currency: z.string(),
        context: z.string().describe("What the amount refers to"),
      })
    ),
  }),
  prompt: `Extract all entities from this document:\n\n${document}`,
});

Multi-step chains

Output of one structured call feeds the next. For example: extract entities, classify each one, then generate a summary. Each step has its own schema. This is where structured output really shines, because each step's output is guaranteed to be parseable as input to the next step.

When structured output still fails

Constrained decoding guarantees valid JSON matching your schema. It does not guarantee correct values. This is the most important thing to understand, and the source of false confidence that trips people up.

Hallucinated values that fit the type. A field typed as string will always get a string, but it might be a made-up name, date, or ID. The model will try to fill every field, even when the input does not contain relevant information. You get clean, well-typed data that is completely fabricated.

Reluctance to return empty arrays. LLMs struggle to return []. They tend to fabricate entries rather than leaving an array empty. If your schema has an array of extracted entities and the input has none, you will likely get phantom entities.

Semantic validation still matters. The model might put the person's company in the email field. The JSON is valid, the types are correct, but the values are in the wrong places. Always validate semantically, not just structurally.

const schema = z.object({
  name: z.string().min(1),
  email: z.string().email(),
  company: z.string().nullable(),
});
 
const result = schema.safeParse(object);
if (!result.success) {
  // Retry with error context
  const retry = await generateObject({
    model: openai("gpt-4o"),
    schema,
    prompt: `Extract contact info. Previous attempt had errors: ${result.error.message}`,
  });
}

Zod's .email(), .url(), .min(), .max() validators catch some of these issues. For more complex validation (does this email domain actually exist? is this date in the future?), add application-level checks after the LLM call.

Schema complexity limits. OpenAI caps at 200 fields across all tools in a single API call and does not support recursive schemas. Very large or deeply nested schemas can cause compilation delays or be rejected entirely. If your schema is complex, consider breaking it into multiple calls.

Alternatives to JSON

JSON is not always the best output format for LLMs. It is verbose, with keys repeated for every object in an array, and the structural tokens (braces, quotes, colons, commas) add up.

For simple outputs, skip JSON entirely. A classification task can return a single word. A list of names can be newline-separated. An enum field is one token. Do not add JSON overhead when you do not need structure.

YAML uses about 10% fewer tokens than JSON for equivalent data and is arguably more readable. But it is harder to constrained-decode (indentation-sensitive) and has less tooling support for grammar-based generation.

Markdown is the cheapest token-wise and is essentially the model's native language. For flat key-value output where you control the parsing, a simple format like Name: Sarah\nEmail: sarah@acme.com works fine and costs fewer tokens than JSON.

The practical rule: use JSON with constrained decoding for anything that feeds into application code. Use simpler formats for human-readable output or when token costs matter and the structure is flat.

The takeaway

For any application that consumes LLM output programmatically, use structured output APIs rather than parsing free text. The reliability difference is night and day. But remember that structural validity is not semantic validity. Always validate, always handle the case where the model fills in plausible-looking nonsense, and keep your schemas as simple as the task allows.

Sources

Enjoying the blog? Subscribe via RSS to get new posts in your reader.

Subscribe via RSS