From “cool demo” to actionable playbook.
Here’s what the community asked, and what you can put into practice right away.
In the webinar, Nikolay showcased a multi‑agent PR review system (code quality, test quality, security, prioritization) that analyzes every GitHub PR/commit and pushes results to a live dashboard. It was stitched together with v0/Vercel for the UI, Claude Code for implementation, GitHub Actions for CI, CodeRabbit & GitHub Copilot for reviews, and Temporal for orchestration.
TL;DR (What you’ll learn in this post)
- How to keep tests stable when AI suggestions are nondeterministic (without over‑mocking)
- How testing evolves when most of the app is AI‑generated (think contracts + runtime monitoring + evals)
- The four metrics that keep reliability from quietly degrading as shipping speed accelerates.
- A pragmatic way to triage AI‑amplified PR noise in open source and internal repos.
- Why Temporal‑style orchestration is the hidden superpower for durable, long‑lived agent workflows.
What We Built (and Why It Matters)
Nikolay demoed an end‑to‑end system that runs on every pull request:
- Agents: code quality, test quality, security, complexity, and a prioritization agent that ranks the findings.
- Dashboard: a v0/Vercel front end that aggregates metrics (e.g., PRs analyzed, issue counts by category, time saved) and makes a “block/allow” recommendation.
- Workflow: GitHub Actions triggers the analysis; Temporal orchestrates the agent steps and makes failures recoverable (resume exactly where the system left off).
This isn’t just “prompting”—it’s a durable system where AI is one component among CI/CD, storage, and orchestration. As Nikolay put it, we’re moving from “building prompts” to building systems that coordinate many agents into reliable outcomes.
“AI didn’t remove judgment. It amplified the cost of bad judgment.”
The Q&A: Your Biggest Questions, Answered
1) How do we write stable test assertions when AI tools generate code nondeterministically—without over‑mocking?
Bottom line: Treat tests as requirements that constrain the system. Have the AI implement code to pass your tests, not the other way around. Prioritize integration/API tests where AI tends to stumble (auth, database, external services), and do not allow the AI to modify tests just to make a build green.
“If you decide the tests are correct, don’t allow the AI to change them.”
Practical moves
- Start with a thin UI smoke layer; invest heavier in integration and contract tests.
- When using an AI builder (e.g., v0), do quick manual verification on UI, then drive coverage deeper where regressions hide.
2) When most of the application is AI‑generated, how should testing evolve—contracts, runtime monitoring, BDD?
Short answer: All of the above. If code is cheap now, tests are cheap too. Combine contract tests for interfaces, behavior checks for flows, and runtime monitoring to detect drift. Also, treat prompts as first‑class system artifacts—use LLM evals to verify that prompt tweaks or model swaps don’t degrade behavior.
Why evals? Even with identical prompts, model runs can return slightly different outputs; evals give you a regression harness for your AI layers.
3) With cycles shrinking from months to minutes, what quality metrics keep reliability from decaying?
Track these four relentlessly:
- Code quality (e.g., cyclomatic complexity trending), 2) Defect leakage to production, 3) Automated test pass rates (resist bulldozing through reds), and 4) Uptime/availability—especially for agentic systems with multiple external dependencies.
4) Open‑source maintainers are flooded with AI‑generated low‑quality issues/PRs. How do we manage the noise?
There’s no silver bullet yet. The direction is to automate first‑pass triage (AI PR reviews + a prioritization agent) and standardize the inputs you accept—but that creates new prioritization/testing challenges that teams must keep iterating on. The demo’s prioritization agent reduces attention tax per PR; a future “product owner” agent could prioritize across the entire repo.
5) How do we shortlist AI tools for QA when the market changes daily?
Start with low‑friction value: e.g., CodeRabbit for PR reviews (free tier), and inexpensive models like GPT‑4o mini via SDKs that you can call inside your test harness. Then scale toward enterprise options that fit your stack and governance needs (e.g., Tosca, Testim, NeoLoad, qTest—each covering a different testing area).
6) Will “the next big model” replace today’s LLMs? What happens to projects built on current models/MCP/Copilot?
Treat the model as configuration. Point your system at the new model via config, then use your test/eval suites to ensure the new model meets or beats the old baseline on your tasks before you promote to production.
7) How do we test an AI system that makes decisions (e.g., JD/CV analysis → interviews → suitability scores)?
Build a domain benchmark/eval set: a library of standard inputs (synthetic + representative real‑world cases) with expected outcomes. Continuously run the system on this corpus to track decision quality, run‑to‑run consistency, latency, and fairness. Think SWE‑bench for your workflow, with domain‑specific gold cases and edge cases.
8) How do we coordinate multiple agents on the same codebase?
Start simple: call one agent at a time via an SDK (e.g., OpenAI), pass the same PR context (PR number, title, description, author, diff, test summary), constrain outputs to JSON, and collect results. For scale and reliability, adopt an orchestrator like Temporal to parallelize, retry with backoff, set timeouts, and—crucially—resume exactly at the failure point after outages or code fixes.
In the demo, Nikolay intentionally killed the worker mid‑run; Temporal kept the workflow waiting, then resumed from the exact step once the worker restarted—no reruns, no wasted tokens/time.
9) Career question (student): “With AI building front ends/back ends so fast, should I still learn to code? AI/ML or software engineering?”
Begin with your objective; what kinds of systems do you want to build or problems to solve? The discussion emphasized aligning the path (AI/ML vs. SWE) to goals and following up with specifics.
5 Moves You Can Make This Week
- Lock your tests. If the AI breaks them, fix the code—not the tests. Add a linter/check that prevents PRs from modifying “golden” test files without human approval.
- Shift coverage down the stack. Create or expand API/integration tests for auth, DB, and external calls; keep UI tests light but meaningful.
- Add a prompt eval suite. Store prompts as markdown, version them, and add evals to your CI to catch regressions after prompt edits or model switches.
- Plug in a PR reviewer. Start with CodeRabbit and compare its comments to Copilot’s; track which catches more issues relevant to your codebase.
- Pilot orchestration. Wrap one multi‑step agent workflow in Temporal and simulate failures (timeouts, worker restarts) to validate durability before expanding.
Under the Hood: Why Orchestration Is the Real Multiplier
Agentic systems are distributed systems. Latencies spike, models rate‑limit, networks drop. Without orchestration, your “AI pipeline” is fragile and hard to debug. In the demo, the Temporal workflow modeled each agent as a step, captured inputs/outputs, handled long waits gracefully, and recovered deterministically at the exact point of failure—no manual replaying or token waste. That makes experiments safer and production more predictable.
Closing Thought
The takeaway from “Zero to Prod” isn’t “AI writes the app for you.” It’s that AI makes code cheap—so your job is to make systems reliable: guardrails in prompts and tests, metrics that surface drift, triage that scales, and orchestration that turns a handful of smart agents into a resilient production line.
If you’d like to see the full webinar in action, check out “Webinar Recording: Zero to Prod - Building & testing web apps with AI ”
