Understanding Observability

Last week my understanding of Observability went up astronomically. In fact, it was taken to an all new dimension, all by one tweet by Charity Majors (@mipsytipsy):

Observability, short and sweet:
– can you understand whatever internal state the system has gotten itself into?
…just by inspecting and interrogating its output?
…even if (especially if) you have never seen it happen before? https://t.co/0HslcUKoOQ
— Charity Majors (@mipsytipsy) November 26, 2019

Let me begin by taking you back to a consulting engagement I worked on recently. I was working with a client helping them determine how to improve the maturity of their SRE team. (Spoiler alert – they sucked. Renaming the team of DevOps engineers as ‘SREs’ does not make them SREs). Getting back to my story, they had built what they called their ’Observability’ platform. It was pretty impressive (their concepts of SRE were however a gross misunderstanding of what it means to adopt SRE in an enterprise that does not truly have resilient, atomic services like Google, but more on that in a later post). Back again to my story – their ‘Observability’ platform allowed them to monitor their services at the infrastructure, system and service levels. They continuously ran a slate of synthetic transactions at a fixed cadence from a server outside the firewall and measured key SLI’s. The automation of their alerts was on steroids. Anytime any threshold of any SLI was breached, alerts were sent out and tickets were opened and assigned. The monitoring across the Infrastructure, system and service level allowed them to pinpoint the potential areas of the stack impacted, allowing the ‘SRE’ team to respond appropriately. I walked away from the engagement delivering them a set of recommendations to improve their processes to better their SRE maturity, and with their monitoring and alerting setup becoming my understanding of Observability.

But I was just scratching the surface here. After dissecting Charity’s said tweet, and others on her Twitter thread, I understood that what my client had was good Telemetry, not Observability. Let me explain by parsing Charity’s tweet for you and providing my thesis of the same.

Statement 1:

– can you understand whatever internal state the system has gotten itself into?

Oh sure, if I have good telemetry at both the system level and individual service level of my system, sure I can understand what state my system is at.

Statement 2:

…just by inspecting and interrogating its output?

Now that’s interesting. Just by inspecting logs (yes, logs are output too) and interrogating the regular output data coming from the system. Maybe…

Statement 3:

…even if (especially if) you have never seen it happen before?

Houston, we are stumped. ‘If we have never seen it before’ is a problem. All monitoring is based on a prior determination of what needs to be monitored. What SLIs are important to us and hence need to be measured? What thresholds or SLOs, when breached, merit an alert or a ticket being opened? What will be needed to be known about the systems state at such a time? These are all assumptions that need to be made before we set up our monitoring regime. This does not necessarily make the monitoring capable of being used to determine the internal state of the system from ‘any’ perspective. To look at the statement in the original Tweet Charity was responding to, ‘Observability is about being able to ask arbitrary questions about your environment without—and this is the key part—having to know ahead of time what you wanted to ask’. This part, as it states, is key – ‘being able to ask arbitrary questions of the system without knowing what you may want to ask beforehand’.

F-35-Cockpit — F35 Cockpit. Image Source: Internet

A cockpit of an F35 fighter jet has dozens of widgets providing the pilot Telemetry about the aircraft and its operating environment. At any given moment the pilot can see what is happening in the jet, and with radar and other sensors even around it. But she cannot ask the console to provide her with answers to questions she needs answered which were not thought of beforehand. She can ask, ‘from my current position and heading, do I have enough fuel to get back to my carrier which is moving at 15 knots at a SSE heading’. She cannot though, pull a Tony Stark and have an free-form conversation with J.A.R.V.I.S in the Iron Man suit:

Tony: How many (people who have fallen out of Air Force One) in the air?”
J.A.R.V.I.S.: Thirteen, sir.”
Tony: How many can I carry?”
J.A.R.V.I.S.: Four, sir.”
– Source: Iron Man 3

What the F35 pilot has is Telemetry. Iron Man has Observability.

Going back to my client, they too have just Telemetry. In fact, great telemetry. But not Observability. If they want to understand the state of the system at any given time, they can only make assumptions of the state based on all the telemetry data they are gathering. ‘No SLI is outside its pre-determined threshold, all must be healthy’. They can ask, ‘which service was the performing slower than average when the last flight to Sydney from Dallas was boarded and closed’. They can do root cause analysis of an outage by asking which services were impacted. They cannot however ask, ‘find me all the iOS Mobile App users on iOS version 13.x who were unable to load the seat map for Boeing 777-300ER aircraft while checking-in to their flight to Sydney today’, similar to what Charity wants to ask in her seminal blog post on this topic. You cannot ask this question because you did not set up monitoring to get the right data, and more importantly you do not have the tracing capability to trace the series of events (failed seat map load while checking-in) thru the system across multiple dimensions (user’s app type, OS version, Aircraft type) that would be needed to answer the question. Because you did not know beforehand that you would need to ask this question. Hence, you have Telemetry, not Observability.

To quote Charity again:

Observability requires methodical, iterative exploration of the evidence. You can’t just use your gut and a dashboard and leap to a conclusion. The system is too complicated, too messy, too unpredictable. Your gut remembers what caused yesterday’s outages, it cannot predict the cause of tomorrow’s.

For more detail, I refer you back to her post.

I am at a much higher plane of understanding of Observability today thanks to @mipsytipsy. Keep sharing the knowledge Charity!

I would love to talk more about Observability, both with people learning about it, and those implementing it (or planning to do so).

What are your use cases?
What tools are you using/evaluating?
What business value do you expect to extract? How are you building the business case?
What organizational and cultural change do you need to become effective?

Happy share more of my thoughts and experiences, if you want to have a chat. Leave a comment below or hit me up at sanjeev@sdarchitect.blog.

— Sanjeev Sharma
Principal Analyst, accelerated strategies

3 Comments Add yours

Niladri Choudhuri says:

January 9, 2020 at 5:34 am

Great Article. This brings a point to my mind. Observability has to be architected.

Pingback: You don’t need SRE. What you need is SRE. – Sanjeev Sharma's Site
Pingback: Multi-Cloud Snake Oil – Sanjeev Sharma's Site

Home: sdarchitect.blog

3 Comments Add yours

Leave a comment Cancel reply

Share this:

Related

Published by Sanjeev Sharma

3 Comments Add yours

Leave a comment Cancel reply