The world is talking about vaccines everywhere, thanks to Covid. What do vaccines do to us? We are told that we will not get sick. Instead, our body will build up immunity to the disease-causing microorganism, making it impossible for us to get infected now or in the future.
This illustration has a lot of similarities to the Chaos testing. A computer system has its own limits and its own points where it can fail. If somehow we can inject it with variables that can cause disruption, the teams can identify the system’s weakness and the vulnerabilities present in it. The necessary step-by-step solutions can be put in place. The important ones will be the protocols that will eventually allow the computer systems to become even more resilient and fault tolerant.
How is Chaos testing different from normal testing?
There are many ways where chaos testing is different from regular testing.
- While chaos testing factors in the numerous touchpoints that are outside the control of IT teams, normal testing considers only the one that is within the testing purview.
- Regular testing usually takes place during the build/compile phase of the project whereas chaos testing happens when the system is ready.
- Regular testing does not usually involve the testing of varying configurations, behaviors, outages, and other disruptions caused by a third-party unlike chaos testing
- Normal testing rarely addresses the immediate resolution of negative reactions of the end users. It results in a disabled system waiting to be fixed for the testing to resume. Chaos testing on the other hand just introduces issues in the system to see how it behaves.
- Regular testing finds bugs and then a blocker may result in a system hang. Chaos testing has a pre-determined abort plan which allows room for error in case the anticipated reactions are incorrect.
Chaos testing test pyramid
A typical test pyramid for Chaos testing looks like the following:
- Unit Testing
Unit testing helps us evaluate a component’s or unit’s specific behavior. However, it is essential that it should be removed from all its dependencies and should be tested as a standalone entity.
- Integration Testing
It focuses on the transactions between individual units and how they are connected to each other. Once unit tests are successful, engineers logically connect these units and then perform their tests. This helps in ascertaining how stable the integrated entities are as a collective bunch.
- System Testing
Systems tests do a proactive evaluation of how the entire computer system reacts under the increased stress of a particular, worst-case failure scenario. These are always performed in real-world situations in production environments.
Chaos Experiment
A set of predefined metrics should be in place to conduct chaos testing. The metrics will help in evaluating the results. Any irregular behavior should be trackable using these metrics. If the chaos test begins impacting the system too much and puts negative impacts on customers, then we should have a rollback plan handy. IT teams should be ready to manage any complex situation and for that, some Alert mechanisms should be in.
Some examples of system metrics in need of possible alert mechanisms include the following:
- CPU exhaustion
- Dependency failures
- Memory overload
- Network latency and failure
- Retry storms
- Race conditions
- Significant fluctuations in the input
- Failures in communication between services
Benefits of Chaos testing
Below are the key benefits of Chaos testing
Benefits of Chaos Testing | |
Five-Nines availability | One of the key benefits of chaos engineering is the very high availability of the system for its end users. Five-Nine availability means the system is up 99.999%. This means there are very less chances of system outages. |
Financial profits | Even a very small outage can cause companies to lose millions of dollars. With chaos testing promising to keep the system up, companies are eying at increasing revenues. |
Better disaster recovery plan in place | Chaos testing is a way to proactively eliminate, or at least reduce, the frequency and severity of any system disaster. The teams are more equipped to handle those, and therefore have better plans in place. The plans get better with more disasters avoided or recovered |
Efficient coding | Since engineers know that their code will be tested for Chaos testing, they are challenged to write better codes to ensure the final system is as resilient as possible. They start thinking out of the box and bring innovative ideas into place. |
Conclusion:
The goal of chaos engineering is to educate and inform organizations of vulnerabilities and unanticipated outcomes of a computer system that are previously unknown. This will help companies to focus on identifying hidden problems these vulnerabilities might produce during production environments that precede an outage failure outside of the organization’s control. The recovery teams can then address systematic weaknesses and put a solution in place to enhance the system’s overall fault tolerance and resiliency.
The world today is fast advancing with rapid and Internet-based technologies in place. No organization is safe from system failure. Infrastructure is becoming more and more volatile. Systems will break. And as the complexity of cloud-based technologies continues to grow and expand, these systems will break even more frequently, in even more dissimilar ways, and at the most inconvenient times. Chaos testing helps large and small organizations to conduct tests mimicking these vulnerabilities and trying to recreate real-life issues while the system is in test.
With the availability of modern tools, this new way of testing is fast becoming a very popular testing way to give users safe and secure software.