Skip to main content

QA true crime challenge

  • June 17, 2026
  • 5 replies
  • 0 views

PolinaKr
Forum|alt.badge.img+8

Solve the challenge by Christine Pinto and get a chance to win our gift box 🎁

 

What's a test in YOUR system that everyone agrees is 'probably fine' but nobody's ever actually run? Tell us what it is and what you think would happen if it broke.

5 replies

Hello, I have been working on a legacy application migration to AWS platform. The application is setup in AWS with a disaster recovery site, but never validated whether it actually works! No one bothered to test if the traffic is getting routed properly to this recovery site. I had been pointing out this gap and asking to prioritize some kind of test for this feature but everyone brushed it off saying we trust AWS to do that seamlessly. After several months post production, AWS’s US eastern zone1 had an outage of several hours and the application was failing to route traffic to the disaster recovery zone. Lost several million dollars that day! Later, upon testing this feature in the test environment we figured out that the critical config to enable traffic to route to the other site was missing. Lesson learned!


  • Space Cadet
  • June 17, 2026

We were migrating a hotel app from Heroku to AWS. One day, no rollback, no backup option. All payments through Stripe.

We tested everything we could on staging with Stripe sandbox before the migration. But I pushed that we needed to test with production tokens. Management kept saying no for a while, but eventually agreed. When we did turns out developers didn't account for the config being different in production. Webhook secret was set in prod but not on staging, and the code was just silently skipping webhook signature validation when the secret wasn't there. Someone added it as a local dev convenience and it made it all the way to the AWS config. So the entire webhook flow payment confirmations, cancellations, refunds would have been silently returning 400s. Stripe retries a few times then just gives up. Money gets charged, booking stays pending forever, guest gets no confirmation.And finding it would have been a nightmare. Stripe shows successful payments on their side, frontend shows the request went through. The only trace is quiet 400s on the webhook endpoint sitting somewhere in AWS logs which nobody would think to check, because on staging they never existed.

That one argument with management probably saved the whole migration.


Ramanan
Forum|alt.badge.img+7
  • Ace Pilot
  • June 17, 2026

Solve the challenge by Christine Pinto and get a chance to win our gift box 🎁

 

What's a test in YOUR system that everyone agrees is 'probably fine' but nobody's ever actually run? Tell us what it is and what you think would happen if it broke.

@PolinaKr , ​@Mustafa 

"The Ghost Test" 👻


In our system, there's a test called validate_legacy_payment_fallback() written in 2019 by someone who left the company before I joined. It's been green ever since. Nobody touches it. Nobody questions it. It has a comment that says "DO NOT MODIFY - critical path" with no further explanation.
What does it actually test? Allegedly, the failover logic when the primary payment gateway times out. But here's the thing, the mock it uses points to a sandbox URL that was decommissioned in 2021. It's been testing... nothing. A ghost handshake with a server that no longer exists.
What would happen if it broke?

Honestly? We'd probably only find out during a real production outage at 2am on a Friday when a payment gateway actually goes down and our "tested" fallback silently fails, taking the checkout flow with it. Revenue stops. Alerts fire. Someone screenshots the test suite showing

"ALL GREEN ✅" and posts it in Slack as a joke. It would not be funny.


The real crime? The false confidence. Green tests feel like safety. But an untested test isn't a safety net it's a painted floor drain.
Now I'm going to go run it. Wish me luck. 🫡


  • Space Cadet
  • June 17, 2026

Solve the challenge by Christine Pinto and get a chance to win our gift box 🎁

 

What's a test in YOUR system that everyone agrees is 'probably fine' but nobody's ever actually run? Tell us what it is and what you think would happen if it broke.

Requirement to run H/W failure tests, but then people recall there are little amount of systems, they are expensive, and we have deadlines ahead. So, people agree to have that, but “not this time”.


  • Space Cadet
  • June 17, 2026

The Test We All Agreed Was Important—But Never Ran
 

The test nobody had actually run in our system was a full rollback and data recovery drill.

Everyone agreed it was important.

Everyone agreed we should test it.

But release schedules, customer commitments, and time constraints always pushed it to the next sprint.

On January 25, 2026, we released a feature to production. As part of the release planning, we felt confident because we had a recovery strategy in place. If something went wrong, we could roll back the deployment. We also had a daily recovery script that processed logs every evening at 8 p.m and was expected to help us restore customer data if needed.

On paper, the plan looked solid.

The uncomfortable truth was that neither the rollback process nor the end-to-end recovery procedure had been exercised in months. For nearly six months, we had trusted that everything would work when needed, but we had never validated it under realistic conditions.

Unfortunately, the release exposed a defect. A small implementation mistake resulted in critical data not being passed correctly into the generated JSON output. Instead of producing usable recovery data, the process created empty files.

At first, the issue seemed manageable because we believed our recovery mechanisms would protect us.

They didn't.

When we attempted recovery, we discovered that our assumptions were stronger than our evidence. The recovery path we trusted had not been tested thoroughly enough to reveal the gap. As a result, one of our major clients lost thousands of records, and we were unable to recover them as expected.

What stays with me is that the root cause wasn't only the defect itself.

The bigger problem was our confidence in a safety net we had never fully verified.

Since then, I've viewed backup, rollback, and recovery testing differently. A backup is not proof of recoverability. A rollback plan is not proof of recoverability. Documentation is not proof of recoverability.

The only proof is successfully performing the recovery before the emergency happens.

That's why my answer is simple:

The test everyone assumed was "probably fine" was our rollback and recovery drill.

And the day it failed was the day we learned the difference between having a recovery plan and having a proven recovery capability.