Chaos Testing

We always want our production systems to be up and running. Once they go down, getting them back up is the first order of business for any organization. However, you have to eventually accept the fact that systems will fail, and you should focus more on making sure that the system can get back online as fast as possible. As part of this, you might want to fail your system in a controlled manner, where you check if your system can recover by itself when an outage hits. This is especially useful in Kubernetes since Kubernetes is supposed to be self-healing by nature. This means you should be able to break stuff at an infrastructure level and see if the cluster can heal automatically. You also need to check at an application level and see if your applications support this self-healing nature of Kubernetes. This process is called Chaos testing.

Chaos testing is a strategy for testing the resiliency of a system by intentionally introducing failures and observing how the system responds. It’s essential to conduct chaos testing in a controlled and systematic manner to avoid unexpected damage to production environments. Before starting, clarify what you want to achieve. Are you testing system resilience, latency under load, or fault recovery mechanisms? For example, “If a service crashes, the system should recover within X seconds without affecting customer experience.”. It’s also not recommended to run tests like this in production environments, and instead begin with a non-production environment to experiment with chaos testing tools and techniques. This reduces the risk of significant outages during testing. When testing in production, ensure you have safeguards like traffic routing, feature toggles, and rate limits to mitigate the blast radius.

At first, its best to start small simulating a single service failure, or delay, before moving to more complex multi-service or system-wide failures. A good example would be to kill pods, introduce network latency, stress test CPU/memory, etc… The idea is that chaos testing should replicate real-world scenarios like network disruptions, service crashes, or hardware failures. On the other hand, spending time simulating exceptional cases such as etcd having issues might not be a good idea since its’ pretty unlikely to happen.

Finally, ensure you log all chaos test activities, system metrics, and response times. Review logs and performance metrics to identify failures. Perform a detailed analysis of any unexpected outcomes. Was the hypothesis correct? Did the system recover as expected? What can be improved? Then based on the results, refine your hypotheses and test scenarios. Add new failure scenarios based on what you’ve learned. A final thing is to continue this process to ensure the system remains resilient as new changes are deployed or new features are added. You can also integrate these tests with CI/CD piplines so that it gets tested everytime you do a change.

Now that you know all about chaos testing, it’s purpose, and what your objectives are, let’s jump into a Chaos Lab with Chaos Mesh.

Next: Chaos Mesh Lab