Chaos Engineering, or chaos management in infrastructure, began to gain popularity in 2010. The project workload kept growing, and the architecture became more complicated in response to the growth and complexity of the calls. Around this time, the Netflix developers thought it would be a good idea to test how the infrastructure reacts to random but programmed errors at different levels.
Chaos Engineering was an unknown territory at that time, where only big companies and some enthusiasts did something. Today, if we build a solution in the cloud, it becomes increasingly difficult to create a truly redundant and fault-tolerant infrastructure without Chaos Engineering. Increasingly, there are claims that Chaos Engineering is a mechanism that offers significant assistance in identifying “known unknowns” (the most dangerous for infrastructure and applications – things we don’t know about and don’t understand).
As a result, a decent toolkit has appeared that allows you to quickly start working with the Chaos Engineering concept in just 5 minutes.
In this article, we will see what challenges Chaos Engineering faces and also launch an experiment, that is, a situation in which an emergency will be automatically created, using the example of Managed Kubernetes. Let’s discuss how to create from this not a one-time story but a process and where this process is in a modern company.
A Modern IT System Must Respond To External Challenges
In business and life, you must constantly prioritize and choose areas of work that, with a certain guarantee, will reduce the time to market or bring money. Managing chaos, at first glance, will not lead to the first or the second. But it is not. Misunderstanding of the essence of chaos management and fear of its basic definition, which says that we “intentionally break systems,” has long stopped companies from studying and applying.
However, time passes, and the systems’ pressure increases yearly. In recent years, the amount of traffic and the expected speed of receiving a response from web systems for users has been growing. Users expect to get a response quickly, and they don’t care if the system is under heavy load or not. If the user doesn’t get a response quickly, they’ll leave. Imagine that we announced a project on Habré and caught the Habr effect – no one will wait for a response under peak load. As a result, they say that a slowly working project is worse than a non-working one. If the user does not open the site for no reason, he can return. If the site slows down, he will criticize the developers and be disappointed.
All this should keep architects and operations teams on their toes, as they must constantly look for bottlenecks and test the most uncomfortable and unusual conditions.
Since conditions are constantly changing, we cannot approach them from the position of classical solutions and must apply flexible approaches and arrange experiments. Cloud platforms help solve these problems through elasticity (flexible provisioning of resources on demand), reliability, and a pay-as-you-go model. However, when we migrate to the cloud or design our application to use the full power of cloud tools (load balancing, autoscaling, and many others) as part of the Cloud Native approach, we are faced with the fact that the systems may behave incorrectly.
Each component is remarkably tested and debugged in work with others. We know what to expect from each microservices, platforms like Kubernetes, or external services that implement routine tasks, such as a Mail server. But in the constantly changing external conditions affecting our system, it is not easy to understand how one component will behave when the other behaves non-standard. Dependency Hell also happens when there is so much of everything that it becomes difficult and very expensive to understand what is happening and how to support it effectively. The cause of the failure in such conditions is difficult to find, and in production, it is expensive.