AI Tip of the Week #15: Make AI checks testable with Structured Outputs (JSON Schema)

Forum|Forum|1 month ago
January 22, 2026
2 replies
41 views

+9

Mustafa
Technical Community Manager

One of the biggest hurdles when testing AI-powered features is the oracle problem—it’s hard to state what the “expected result” should be, especially for language outputs that vary from run to run. Formal guidance for testing AI systems even calls this out explicitly.

The tip: whenever your product uses an LLM, force its response into a strict JSON Schema and assert on fields—just like any other API. OpenAI, for example, supports Structured Outputs that guarantee responses conform to a schema; Microsoft’s Semantic Kernel shows the same approach in app code. This turns fuzzy text into deterministic fields your tests can validate.

Why this works (and what to watch for)

From prose to contracts. A JSON Schema acts as a “contract” (types, enums, required fields). When the model adheres to your schema, your test can assert on booleans, enums, and numbers—not on paragraphs.
Vendor‑agnostic pattern. Independent evaluations show multiple vendors can produce structured JSON at scale, though APIs differ (e.g., tool-calling vs. direct schema). Account for these differences in your harness.
Determinism is better, not perfect. Using temperature=0 and fixing a seed improves repeatability, but cloud stacks can still show run‑to‑run drift (model snapshots, batching, MoE routing). Pin exact model versions and tolerate tiny deltas where needed.

Minimal example: schema-first assertion (Python)

!--scriptorstartfragment-->


# Purpose: Turn LLM output into testable fields with JSON Schema
import json, jsonschema, pytest

# 1) Define the contract your model must satisfy
product_issue_schema = {
  "type": "object",
  "required": ["category", "severity", "repro_steps"],
  "properties": {
    "category": {"type": "string", "enum": ["bug", "accessibility", "security", "perf", "other"]},
    "severity": {"type": "string", "enum": ["blocker", "high", "medium", "low"]},
    "repro_steps": {"type": "array", "items": {"type": "string"}, "minItems": 1}
  },
  "additionalProperties": False
}

# 2) Call your LLM with Structured Outputs (pseudo-call — replace with your SDK)
#    Ensure temperature=0 and pin the exact model version in production.
#    resp = openai.responses.parse(..., text_format=YourPydanticModel / json_schema=product_issue_schema, temperature=0, seed=42)
resp_json = {
  "category": "bug",
  "severity": "high",
  "repro_steps": ["Open settings", "Click Save", "Observe 500 error"]
}

# 3) Validate structure, then assert semantics
jsonschema.validate(instance=resp_json, schema=product_issue_schema)
assert resp_json["severity"] in {"blocker", "high"}  # example business rule
assert "Observe" in " ".join(resp_json["repro_steps"])

OpenAI’s Structured Outputs enforce JSON Schema conformance; SK’s blog shows the same pattern with Pydantic/Zod types. Use these to guarantee the testable structure.
Treat temperature=0/seed as stability aids, not absolute guarantees, and pin exact model identifiers where your provider allows.

Add two safety guardrails (QA-ready)

Prompt‑injection checks in your test data & fixtures. When your app lets models read user or external content (e.g., bug reports, logs), include red‑team strings in tests to ensure your schema + filters hold up (e.g., “Ignore earlier instructions…”). OWASP flags prompt injection as a top LLM risk; test for it early.
Operationalize evaluations for AI behavior. For subjective checks (e.g., “is the summary accurate?”), run batch evals in your pipeline using an eval framework (Microsoft Foundry / Prompt flow; or your own code‑graded tests). Combine code‑graded metrics with periodic human spot‑checks.

Note: “LLM‑as‑judge” graders are useful at scale but can exhibit bias; use them alongside code‑graded or human‑graded tests.

Quick checklist you can drop into your PR template

Response schema defined (JSON Schema / Pydantic / Zod) and enforced by the model API.
Tests validate both structure and business rules.
Model version pinned; temperature=0 (+ seed if available).
Negative tests include prompt‑injection payloads.
Batch evals run in CI for subjective criteria; spot‑check results.
Risks logged against your org’s AI risk framework (e.g., NIST AI RMF profile for GenAI).

Why this belongs in QA (not just “AI engineering”)

Standards bodies and risk frameworks explicitly call out non‑determinism and oracle challenges in AI systems, which is why converting model output into verifiable structures—and then continuously evaluating behavior—fits naturally into a tester’s toolkit.

+3

Bharat2609
Ensign
Forum|Forum|1 month ago
January 24, 2026

One of the biggest hurdles when testing AI-powered features is the oracle problem—it’s hard to state what the “expected result” should be, especially for language outputs that vary from run to run. Formal guidance for testing AI systems even calls this out explicitly.

The tip: whenever your product uses an LLM, force its response into a strict JSON Schema and assert on fields—just like any other API. OpenAI, for example, supports Structured Outputs that guarantee responses conform to a schema; Microsoft’s Semantic Kernel shows the same approach in app code. This turns fuzzy text into deterministic fields your tests can validate.

Why this works (and what to watch for)

From prose to contracts. A JSON Schema acts as a “contract” (types, enums, required fields). When the model adheres to your schema, your test can assert on booleans, enums, and numbers—not on paragraphs.
Vendor‑agnostic pattern. Independent evaluations show multiple vendors can produce structured JSON at scale, though APIs differ (e.g., tool-calling vs. direct schema). Account for these differences in your harness.
Determinism is better, not perfect. Using temperature=0 and fixing a seed improves repeatability, but cloud stacks can still show run‑to‑run drift (model snapshots, batching, MoE routing). Pin exact model versions and tolerate tiny deltas where needed.

Minimal example: schema-first assertion (Python)

!--scriptorstartfragment-->


# Purpose: Turn LLM output into testable fields with JSON Schema
import json, jsonschema, pytest

# 1) Define the contract your model must satisfy
product_issue_schema = {
  "type": "object",
  "required": ["category", "severity", "repro_steps"],
  "properties": {
    "category": {"type": "string", "enum": ["bug", "accessibility", "security", "perf", "other"]},
    "severity": {"type": "string", "enum": ["blocker", "high", "medium", "low"]},
    "repro_steps": {"type": "array", "items": {"type": "string"}, "minItems": 1}
  },
  "additionalProperties": False
}

# 2) Call your LLM with Structured Outputs (pseudo-call — replace with your SDK)
#    Ensure temperature=0 and pin the exact model version in production.
#    resp = openai.responses.parse(..., text_format=YourPydanticModel / json_schema=product_issue_schema, temperature=0, seed=42)
resp_json = {
  "category": "bug",
  "severity": "high",
  "repro_steps": ["Open settings", "Click Save", "Observe 500 error"]
}

# 3) Validate structure, then assert semantics
jsonschema.validate(instance=resp_json, schema=product_issue_schema)
assert resp_json["severity"] in {"blocker", "high"}  # example business rule
assert "Observe" in " ".join(resp_json["repro_steps"])

OpenAI’s Structured Outputs enforce JSON Schema conformance; SK’s blog shows the same pattern with Pydantic/Zod types. Use these to guarantee the testable structure.
Treat temperature=0/seed as stability aids, not absolute guarantees, and pin exact model identifiers where your provider allows.

Add two safety guardrails (QA-ready)

Prompt‑injection checks in your test data & fixtures. When your app lets models read user or external content (e.g., bug reports, logs), include red‑team strings in tests to ensure your schema + filters hold up (e.g., “Ignore earlier instructions…”). OWASP flags prompt injection as a top LLM risk; test for it early.
Operationalize evaluations for AI behavior. For subjective checks (e.g., “is the summary accurate?”), run batch evals in your pipeline using an eval framework (Microsoft Foundry / Prompt flow; or your own code‑graded tests). Combine code‑graded metrics with periodic human spot‑checks.

Note: “LLM‑as‑judge” graders are useful at scale but can exhibit bias; use them alongside code‑graded or human‑graded tests.

Quick checklist you can drop into your PR template

Response schema defined (JSON Schema / Pydantic / Zod) and enforced by the model API.
Tests validate both structure and business rules.
Model version pinned; temperature=0 (+ seed if available).
Negative tests include prompt‑injection payloads.
Batch evals run in CI for subjective criteria; spot‑check results.
Risks logged against your org’s AI risk framework (e.g., NIST AI RMF profile for GenAI).

Why this belongs in QA (not just “AI engineering”)

Standards bodies and risk frameworks explicitly call out non‑determinism and oracle challenges in AI systems, which is why converting model output into verifiable structures—and then continuously evaluating behavior—fits naturally into a tester’s toolkit.

@Mustafa

The oracle problem is very real with LLMs, especially because outputs are probabilistic, not deterministic. Treating responses as contracts using JSON Schema is the right mental shift.

Structured outputs, model pinning, and temperature control bring LLM behavior closer to something QA can reason about and validate.

Also important callout on prompt injection and evals. LLM testing isn’t about checking text anymore, it’s about testing behavior under constraints.

This is exactly where QA expertise adds value in AI systems.

Bharat

Like

+2

ujjwal.kumar.singh
Specialist
Forum|Forum|1 month ago
January 28, 2026

One of the biggest hurdles when testing AI-powered features is the oracle problem—it’s hard to state what the “expected result” should be, especially for language outputs that vary from run to run. Formal guidance for testing AI systems even calls this out explicitly.

The tip: whenever your product uses an LLM, force its response into a strict JSON Schema and assert on fields—just like any other API. OpenAI, for example, supports Structured Outputs that guarantee responses conform to a schema; Microsoft’s Semantic Kernel shows the same approach in app code. This turns fuzzy text into deterministic fields your tests can validate.

Why this works (and what to watch for)

From prose to contracts. A JSON Schema acts as a “contract” (types, enums, required fields). When the model adheres to your schema, your test can assert on booleans, enums, and numbers—not on paragraphs.
Vendor‑agnostic pattern. Independent evaluations show multiple vendors can produce structured JSON at scale, though APIs differ (e.g., tool-calling vs. direct schema). Account for these differences in your harness.
Determinism is better, not perfect. Using temperature=0 and fixing a seed improves repeatability, but cloud stacks can still show run‑to‑run drift (model snapshots, batching, MoE routing). Pin exact model versions and tolerate tiny deltas where needed.

Minimal example: schema-first assertion (Python)

!--scriptorstartfragment-->


# Purpose: Turn LLM output into testable fields with JSON Schema
import json, jsonschema, pytest

# 1) Define the contract your model must satisfy
product_issue_schema = {
  "type": "object",
  "required": ["category", "severity", "repro_steps"],
  "properties": {
    "category": {"type": "string", "enum": ["bug", "accessibility", "security", "perf", "other"]},
    "severity": {"type": "string", "enum": ["blocker", "high", "medium", "low"]},
    "repro_steps": {"type": "array", "items": {"type": "string"}, "minItems": 1}
  },
  "additionalProperties": False
}

# 2) Call your LLM with Structured Outputs (pseudo-call — replace with your SDK)
#    Ensure temperature=0 and pin the exact model version in production.
#    resp = openai.responses.parse(..., text_format=YourPydanticModel / json_schema=product_issue_schema, temperature=0, seed=42)
resp_json = {
  "category": "bug",
  "severity": "high",
  "repro_steps": ["Open settings", "Click Save", "Observe 500 error"]
}

# 3) Validate structure, then assert semantics
jsonschema.validate(instance=resp_json, schema=product_issue_schema)
assert resp_json["severity"] in {"blocker", "high"}  # example business rule
assert "Observe" in " ".join(resp_json["repro_steps"])

OpenAI’s Structured Outputs enforce JSON Schema conformance; SK’s blog shows the same pattern with Pydantic/Zod types. Use these to guarantee the testable structure.
Treat temperature=0/seed as stability aids, not absolute guarantees, and pin exact model identifiers where your provider allows.

Add two safety guardrails (QA-ready)

Prompt‑injection checks in your test data & fixtures. When your app lets models read user or external content (e.g., bug reports, logs), include red‑team strings in tests to ensure your schema + filters hold up (e.g., “Ignore earlier instructions…”). OWASP flags prompt injection as a top LLM risk; test for it early.
Operationalize evaluations for AI behavior. For subjective checks (e.g., “is the summary accurate?”), run batch evals in your pipeline using an eval framework (Microsoft Foundry / Prompt flow; or your own code‑graded tests). Combine code‑graded metrics with periodic human spot‑checks.

Note: “LLM‑as‑judge” graders are useful at scale but can exhibit bias; use them alongside code‑graded or human‑graded tests.

Quick checklist you can drop into your PR template

Response schema defined (JSON Schema / Pydantic / Zod) and enforced by the model API.
Tests validate both structure and business rules.
Model version pinned; temperature=0 (+ seed if available).
Negative tests include prompt‑injection payloads.
Batch evals run in CI for subjective criteria; spot‑check results.
Risks logged against your org’s AI risk framework (e.g., NIST AI RMF profile for GenAI).

Why this belongs in QA (not just “AI engineering”)

Standards bodies and risk frameworks explicitly call out non‑determinism and oracle challenges in AI systems, which is why converting model output into verifiable structures—and then continuously evaluating behavior—fits naturally into a tester’s toolkit.

This is very much relatable. The oracle problem is very real with LLMs, and forcing outputs into a JSON Schema is a practical way to make things testable instead of arguing over text. One thing I would add from a QA point of view is that schema helps, but it’s not the finish line. It makes the output structured, not automatically correct. You are basically shifting the oracle from free-text interpretation to schema design and business rules which is still a big improvement, but comes with its own risks. A few things I have noticed in general:

A response can fully match the schema and still be wrong or misleading. So schema validation needs to sit alongside real assertions and negative cases, not replace them.
The schema itself becomes a test artifact. If it’s too loose or poorly thought out, you just move the ambiguity to a different layer.
Temperature=0 and model pinning definitely help, but drift still happens. I treat these more like stability guards than guarantees.
Prompt injection is often underestimated. If user or external content feeds the model, those attack strings should be part of regular test data, not something tested later. I agree with your core point though, testing LLMs isn’t about checking text anymore, it’s about checking behavior under constraints. That’s exactly where QA thinking fits naturally, not as an afterthought but as part of the design.
It’s a good tip for teams trying to test AI features realistically.

https://beinghumantester.github.io/

Like

Why this works (and what to watch for)

Minimal example: schema-first assertion (Python)

Add two safety guardrails (QA-ready)

Quick checklist you can drop into your PR template

Why this belongs in QA (not just “AI engineering”)

Why this works (and what to watch for)

Minimal example: schema-first assertion (Python)

Add two safety guardrails (QA-ready)

Quick checklist you can drop into your PR template

Why this belongs in QA (not just “AI engineering”)

Why this works (and what to watch for)

Minimal example: schema-first assertion (Python)

Add two safety guardrails (QA-ready)

Quick checklist you can drop into your PR template

Why this belongs in QA (not just “AI engineering”)

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded