Understanding Observability

Last week my understanding of Observability went up astronomically. In fact, it was taken to an all new dimension, all by one tweet by Charity Majors (@mipsytipsy): Observability, short and sweet: – can you understand whatever internal state the system has gotten itself into? …just by inspecting and interrogating its output? …even if (especially if)…

Cloud Service Reliability (Part 2): Houston, we have an… outage!

Of all the phrases from movies that have become a part of pop culture – from ‘Luke, I am your Father’, to ‘Play it again, Sam’, none have made it more to daily usability than ‘Houston, we have a problem!’ from the movie Apollo 13. Reminding my son ‘Saransh, I am their father’ does not…

Cloud Service Reliability (Part I): Apollo 13 to Google SRE

Apollo 13, in my humble opinion, is the best movie ever made on engineering and especially reliability engineering in action. It is certainly one of my personal favorite movies of all times. But how I view the movie differs by a full 180-degrees how the rest of my family views it. To them it is…