Chaos Engineering
Chaos engineering is a discipline that consists of deliberately causing failures in production (or in a realistic pre-production environment) to test the resilience of a distributed system and uncover weaknesses before…
Chaos engineering is a discipline that consists of deliberately causing failures in production (or in a realistic pre-production environment) to test the resilience of a distributed system and uncover weaknesses before they cause a user-facing incident.
The approach was popularised by Netflix with Chaos Monkey, a tool that randomly shuts down production instances. The practice spread thanks to platforms like Gremlin, AWS Fault Injection Simulator and LitmusChaos.
It is a natural complement to SRE culture and observability: you don't really know whether a system is resilient until you've tested it.
