Ever since Google published the Site Reliability Engineering (SRE) book in 2016, the SRE movement has changed how organizations look at reliability, and incident response and management. Not unlike DevOps, working on adopting SRE is resulting in an organizational cultural shift. A shift which is changing how organizations are organized, on how information flows within an organization that would allow for…
Tag: SRE
Understanding Observability
Last week my understanding of Observability went up astronomically. In fact, it was taken to an all new dimension, all by one tweet by Charity Majors (@mipsytipsy): Observability, short and sweet: – can you understand whatever internal state the system has gotten itself into? …just by inspecting and interrogating its output? …even if (especially if)…
Cloud Service Reliability – Part 3: ‘Antifragile’ – When DevOps met SRE
In part 2 of this blog series I introduced the concept of Antifragile systems (or services). Systems that are neither fragile or robust. They are systems that thrive in chaos. That are architected, developed, deployed and run from the ground up in a manner to achieve the SLOs of availability and responsiveness expected from today’s…
Cloud Service Reliability (Part 2): Houston, we have an… outage!
Of all the phrases from movies that have become a part of pop culture – from ‘Luke, I am your Father’, to ‘Play it again, Sam’, none have made it more to daily usability than ‘Houston, we have a problem!’ from the movie Apollo 13. Reminding my son ‘Saransh, I am their father’ does not…