This article was written by Gary Parker and published on the TestProject Blog
As industries grow and so do the platforms supporting their scale, we continue to add more potential causes for failure In an ideal world, we build our systems in a way that they’re broken down into smaller services or components, as part of a larger system.
But what if one part fails? And in a worst-case scenario, how does the legacy monolith system that your team has been supporting escape from an outage? Can you get the system back up and running in minutes? Hours? Days?
For smaller companies having a few hours of downtime might not be a big deal, your customers are loyal and will return – but how about the larger companies with millions of customers per hour? Will they wait for your system to come back online or go to their competitor who has a more robust offering?
What is Chaos Engineering?
We expect the services and platforms that we use daily to always be available and highly responsive. When there is an outage on your favorite platform, everyone hears about it – for the consumer, it’s an annoyance, for the provider, it’s expensive. “59% of Fortune 500 companies experience at least 1.6 hours of downtime per week, which can cost up to $46 million per year.”
In response to this, many enterprise companies have introduced some form of Chaos engineering – the practice of intentionally causing infrastructure or system-related issues, to analyze how the platform behaves.
At the start, the most common outcome will obviously be outages or failures in the system. The idea is that you learn from it, to build a more resilient and robust platform. Over time you’ll be making more refined improvements. For example, understanding how individual services perform under stress.
A Brief History and Background
The concept of Chaos Engineering has been around for over a decade now, originally popularized by the Netflix engineering team who created a tool called Chaos Monkey. There have been various terms and names, such as Chaos Testing, Fault Injection Testing etc. All basically focusing on the same goal of experimenting and measuring impacts on the system.
In more recent times, Netflix has evolved their Chaos Monkey concept and introduced a variety of specialist monkeys, as part of their Simian Army. Latency, conformity, doctor, and Janitor are a few examples of the monkeys they have branched out into.
They each have their own responsibilities, behaviors, and criteria for analysis. This approach makes it easy to introduce refined Chaos – Chaos that isn’t completely random and is focused on a specific type of learning
In recent times, a lot of the larger cloud platforms such as AWS, Azure, etc., have started to provide their own built-in Chaos/fault injection tools, allowing you to run and report on experiments easily.
This is great for the broadening of test coverage and knowledge of a testing type that in the past has been quite difficult to access (unless you’re building your own in-house one).
Should I Get Involved in Chaos Engineering?
Yes. Even if you’re only responsible for a small-scale system with a few interconnecting components, there is value in testing the robustness of your system. You can catch potential issues before your user base does.
As mentioned previously, the learning curve to get involved is not as steep as it used to be, and there are a lot of tools on the market to help you get started. It’s also a very valuable initiative to the company you are working for, so in terms of creating a business case to fund the time/costs, it should be a fairly easy process.
Key Takeaways
Experiment often
Your platform, infrastructure, and the world around it are constantly evolving, so you need to experiment as much as you can. You may not get the success you’re expecting straight away, but over time you will make incremental findings.
Break down your system into smaller components, analyze the potential breaking points, and experiment around them. Does high traffic on your platform cause an unresponsive experience for your users, despite having an auto-scaling setup? Simulate spikes in CPU or memory usage and analyze how your system is behaving.
Be proactive, not reactive to failures
Don’t wait for something to go wrong. Expect that if it hasn’t gone wrong yet, it will at some point. Be ahead of your audience and customers, and proactively initiate improvements to protect from such contingencies.
It’s much easier to recover from a downtime or outages if you have previously run through similar scenarios, and have findings to support recovery.
It’s not just about causing failures, it’s about learning from them
Breaking things is easy (and fun), but if you’re not learning anything from these situations there is no benefit. Before you start any experiment, you should have some rough learning objectives and expectations. It should be controlled, measurable, and have actionable learnings once completed.
Robustness and resilience of your system should be SLA’s that you monitor
We tend to measure the healthy things when everything is going great – how fast is our system, what’s the response time, how many users do we have at one time, etc. We don’t really measure when things go badly, usually because everyone is panicking and trying to get things back online.
But we should. How long does it take to get your platform back online from an outage? How quickly can your infrastructure scale when it sees a surge in customers?
Have fun causing Chaos!
I’ve personally carried out quite a few Chaos tests and experiments, and I love it. For something named ‘Chaos Engineering,’ there is a surprising amount of control and planning. It’s not just about breaking things, it’s about building more resilient systems.
Conclusion
Web. Mobile. Performance. Security. Chaos. Add Chaos to your engineering skillset to benefit in terms of personal development and the product you are supporting. There are massive technical and financial benefits as well, so don’t let it be an afterthought!