cover | coverY |
---|---|
.gitbook/assets/[email protected] |
37.839721945250176 |
Chaos Days are a practice within the field of Chaos engineering, which is defined as:
The discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production.
Modern systems consist of a high number of complex components, with equally complex connections between them. Defects are always present, and failures will occur. For an IT system, turbulence comes in many forms, ranging from single-point to multiple, unrelated failures, often combined with sudden changes in external pressure (e.g., traffic spikes).
The complexity of IT systems makes it impossible to predict how they will respond to this turbulence. One such example was a Google Cloud Outage that led to reports of people being unable to operate their home air-conditioning. The trigger was the combined impact of three separate, unrelated bugs, which severely impacted the Google Cloud US network for several hours.
Chaos Days allow teams to safely explore these turbulent conditions by designing and running controlled experiments in pre-production or production environments. Each experiment injects a failure into the system (e.g., terminate a compute instance, fill up a storage device) in order to analyse the impact and system response.
An IT system includes the people who develop and operate it and the knowledge, experience, tools and processes they use to respond to incidents. As John Allspaw puts it:
The analysis of a team’s response to incidents provides many lessons that can lead to improved system resilience. The Oxford Dictionary defines resilience as:
- the capacity to recover quickly from difficulties
- the ability of a substance or object to spring back into shape; elasticity.
Some learning may be technical, such as implementing retry mechanisms and circuit breakers; other types of learning include the processes a team uses to detect, triage and resolve an incident (such as communication channels, escalation routes, runbooks, etc.).
A key but often underestimated benefit is the sharing of internal knowledge that individual team members rely on when working through an incident. By spreading this knowledge across the team, resilience is improved just through people’s greater cognisance of system behaviour and failure scenarios when tackling production incidents or developing system enhancements.\
Chaos Days improve system resilience by developing its:
People, by giving them new:
|
|
---|---|
Processes, through guiding improvements in:
|
|
Products/toolsets, by initiating changes that:
|