SRE
The SRE (Site Reliability Engineer) is an engineer dedicated to the reliability of production systems: availability, performance, latency, capacity, incident management and operations automation.
The SRE (Site Reliability Engineer) is an engineer dedicated to the reliability of production systems: availability, performance, latency, capacity, incident management and operations automation.
The practice was formalised at Google and popularised by the book Site Reliability Engineering (2016). It introduced concepts now considered standard: SLIs (indicators), SLOs (objectives), error budget (the allowed error budget that arbitrates between new features and stabilisation), toil (the repetitive work to be automated), and blameless post-mortems.
The SRE is close to DevOps but with a stronger emphasis on software engineering applied to operations (50% of time spent in code) and on quantified, measured reliability objectives.
