Cloud Service Reliability – Part 3: ‘Antifragile’ – When DevOps met SRE

In part 2 of this blog series I introduced the concept of Antifragile systems (or services). Systems that are neither fragile or robust. They are systems that thrive in chaos. That are architected, developed, deployed and run from the ground up in a manner to achieve the SLOs of availability and responsiveness expected from today’s…

Cloud Service Reliability (Part I): Apollo 13 to Google SRE

Apollo 13, in my humble opinion, is the best movie ever made on engineering and especially reliability engineering in action. It is certainly one of my personal favorite movies of all times. But how I view the movie differs by a full 180-degrees how the rest of my family views it. To them it is…