Skip to main content

AI Tip of the Week #15: Make AI checks testable with Structured Outputs (JSON Schema)

  • January 22, 2026
  • 1 reply
  • 10 views
Mustafa
Forum|alt.badge.img+9

One of the biggest hurdles when testing AI-powered features is the oracle problem—it’s hard to state what the “expected result” should be, especially for language outputs that vary from run to run. Formal guidance for testing AI systems even calls this out explicitly.

The tip: whenever your product uses an LLM, force its response into a strict JSON Schema and assert on fields—just like any other API. OpenAI, for example, supports Structured Outputs that guarantee responses conform to a schema; Microsoft’s Semantic Kernel shows the same approach in app code. This turns fuzzy text into deterministic fields your tests can validate.


Why this works (and what to watch for)

  • From prose to contracts. A JSON Schema acts as a “contract” (types, enums, required fields). When the model adheres to your schema, your test can assert on booleans, enums, and numbers—not on paragraphs.
  • Vendor‑agnostic pattern. Independent evaluations show multiple vendors can produce structured JSON at scale, though APIs differ (e.g., tool-calling vs. direct schema). Account for these differences in your harness.
  • Determinism is better, not perfect. Using temperature=0 and fixing a seed improves repeatability, but cloud stacks can still show run‑to‑run drift (model snapshots, batching, MoE routing). Pin exact model versions and tolerate tiny deltas where needed.

Minimal example: schema-first assertion (Python)

!--scriptorstartfragment-->


# Purpose: Turn LLM output into testable fields with JSON Schema
import json, jsonschema, pytest

# 1) Define the contract your model must satisfy
product_issue_schema = {
"type": "object",
"required": ["category", "severity", "repro_steps"],
"properties": {
"category": {"type": "string", "enum": ["bug", "accessibility", "security", "perf", "other"]},
"severity": {"type": "string", "enum": ["blocker", "high", "medium", "low"]},
"repro_steps": {"type": "array", "items": {"type": "string"}, "minItems": 1}
},
"additionalProperties": False
}

# 2) Call your LLM with Structured Outputs (pseudo-call — replace with your SDK)
# Ensure temperature=0 and pin the exact model version in production.
# resp = openai.responses.parse(..., text_format=YourPydanticModel / json_schema=product_issue_schema, temperature=0, seed=42)
resp_json = {
"category": "bug",
"severity": "high",
"repro_steps": ["Open settings", "Click Save", "Observe 500 error"]
}

# 3) Validate structure, then assert semantics
jsonschema.validate(instance=resp_json, schema=product_issue_schema)
assert resp_json["severity"] in {"blocker", "high"} # example business rule
assert "Observe" in " ".join(resp_json["repro_steps"])
  • OpenAI’s Structured Outputs enforce JSON Schema conformance; SK’s blog shows the same pattern with Pydantic/Zod types. Use these to guarantee the testable structure.
  • Treat temperature=0/seed as stability aids, not absolute guarantees, and pin exact model identifiers where your provider allows.

Add two safety guardrails (QA-ready)

  1. Prompt‑injection checks in your test data & fixtures. When your app lets models read user or external content (e.g., bug reports, logs), include red‑team strings in tests to ensure your schema + filters hold up (e.g., “Ignore earlier instructions…”). OWASP flags prompt injection as a top LLM risk; test for it early. 

  2. Operationalize evaluations for AI behavior. For subjective checks (e.g., “is the summary accurate?”), run batch evals in your pipeline using an eval framework (Microsoft Foundry / Prompt flow; or your own code‑graded tests). Combine code‑graded metrics with periodic human spot‑checks. 

Note: “LLM‑as‑judge” graders are useful at scale but can exhibit bias; use them alongside code‑graded or human‑graded tests.


Quick checklist you can drop into your PR template

  • Response schema defined (JSON Schema / Pydantic / Zod) and enforced by the model API. 
  • Tests validate both structure and business rules.
  • Model version pinned; temperature=0 (+ seed if available).
  • Negative tests include prompt‑injection payloads.
  • Batch evals run in CI for subjective criteria; spot‑check results.
  • Risks logged against your org’s AI risk framework (e.g., NIST AI RMF profile for GenAI). 

Why this belongs in QA (not just “AI engineering”)

Standards bodies and risk frameworks explicitly call out non‑determinism and oracle challenges in AI systems, which is why converting model output into verifiable structures—and then continuously evaluating behavior—fits naturally into a tester’s toolkit.

1 reply

Bharat2609
Forum|alt.badge.img+3
  • Ensign
  • January 24, 2026

One of the biggest hurdles when testing AI-powered features is the oracle problem—it’s hard to state what the “expected result” should be, especially for language outputs that vary from run to run. Formal guidance for testing AI systems even calls this out explicitly.

The tip: whenever your product uses an LLM, force its response into a strict JSON Schema and assert on fields—just like any other API. OpenAI, for example, supports Structured Outputs that guarantee responses conform to a schema; Microsoft’s Semantic Kernel shows the same approach in app code. This turns fuzzy text into deterministic fields your tests can validate.

Why this works (and what to watch for)

  • From prose to contracts. A JSON Schema acts as a “contract” (types, enums, required fields). When the model adheres to your schema, your test can assert on booleans, enums, and numbers—not on paragraphs.
  • Vendor‑agnostic pattern. Independent evaluations show multiple vendors can produce structured JSON at scale, though APIs differ (e.g., tool-calling vs. direct schema). Account for these differences in your harness.
  • Determinism is better, not perfect. Using temperature=0 and fixing a seed improves repeatability, but cloud stacks can still show run‑to‑run drift (model snapshots, batching, MoE routing). Pin exact model versions and tolerate tiny deltas where needed.

Minimal example: schema-first assertion (Python)

!--scriptorstartfragment-->


# Purpose: Turn LLM output into testable fields with JSON Schema
import json, jsonschema, pytest

# 1) Define the contract your model must satisfy
product_issue_schema = {
"type": "object",
"required": ["category", "severity", "repro_steps"],
"properties": {
"category": {"type": "string", "enum": ["bug", "accessibility", "security", "perf", "other"]},
"severity": {"type": "string", "enum": ["blocker", "high", "medium", "low"]},
"repro_steps": {"type": "array", "items": {"type": "string"}, "minItems": 1}
},
"additionalProperties": False
}

# 2) Call your LLM with Structured Outputs (pseudo-call — replace with your SDK)
# Ensure temperature=0 and pin the exact model version in production.
# resp = openai.responses.parse(..., text_format=YourPydanticModel / json_schema=product_issue_schema, temperature=0, seed=42)
resp_json = {
"category": "bug",
"severity": "high",
"repro_steps": ["Open settings", "Click Save", "Observe 500 error"]
}

# 3) Validate structure, then assert semantics
jsonschema.validate(instance=resp_json, schema=product_issue_schema)
assert resp_json["severity"] in {"blocker", "high"} # example business rule
assert "Observe" in " ".join(resp_json["repro_steps"])
  • OpenAI’s Structured Outputs enforce JSON Schema conformance; SK’s blog shows the same pattern with Pydantic/Zod types. Use these to guarantee the testable structure.
  • Treat temperature=0/seed as stability aids, not absolute guarantees, and pin exact model identifiers where your provider allows.

Add two safety guardrails (QA-ready)

  1. Prompt‑injection checks in your test data & fixtures. When your app lets models read user or external content (e.g., bug reports, logs), include red‑team strings in tests to ensure your schema + filters hold up (e.g., “Ignore earlier instructions…”). OWASP flags prompt injection as a top LLM risk; test for it early. 

  2. Operationalize evaluations for AI behavior. For subjective checks (e.g., “is the summary accurate?”), run batch evals in your pipeline using an eval framework (Microsoft Foundry / Prompt flow; or your own code‑graded tests). Combine code‑graded metrics with periodic human spot‑checks. 

Note: “LLM‑as‑judge” graders are useful at scale but can exhibit bias; use them alongside code‑graded or human‑graded tests.

Quick checklist you can drop into your PR template

  • Response schema defined (JSON Schema / Pydantic / Zod) and enforced by the model API. 
  • Tests validate both structure and business rules.
  • Model version pinned; temperature=0 (+ seed if available).
  • Negative tests include prompt‑injection payloads.
  • Batch evals run in CI for subjective criteria; spot‑check results.
  • Risks logged against your org’s AI risk framework (e.g., NIST AI RMF profile for GenAI). 

Why this belongs in QA (not just “AI engineering”)

Standards bodies and risk frameworks explicitly call out non‑determinism and oracle challenges in AI systems, which is why converting model output into verifiable structures—and then continuously evaluating behavior—fits naturally into a tester’s toolkit.

 

@Mustafa 

The oracle problem is very real with LLMs, especially because outputs are probabilistic, not deterministic. Treating responses as contracts using JSON Schema is the right mental shift.

Structured outputs, model pinning, and temperature control bring LLM behavior closer to something QA can reason about and validate.

Also important callout on prompt injection and evals. LLM testing isn’t about checking text anymore, it’s about testing behavior under constraints.

This is exactly where QA expertise adds value in AI systems.