Distributed System Complexity: Part 1 – Engineering Abstraction

Cognitive overload is a brute fact of modern life. It is not going to disappear. In almost every facet of our work life, and in more and more of our domestic life, the jobs we need to do and the activity spaces we have in which to perform those jobs are ecologies saturated with overload. As technology increases the omnipresence of information, both of the pushed and pulled sort, the consequence for the workplace, so far, is that we are more overwhelmed. There is little reason to suppose this trend to change.

A few Thoughts on Cognitive Overload, Intellectica (2000), by David Kirsh, Dept. of Cognitive Science, Univ. California, San Diego

This quote is from a paper published in the year 2000. Now 20 years later, the statements made are even more true, especially for those us developing, delivering and operating complex, distributed systems. Furthermore, we are delivering these systems today on cloud platforms, leveraging myriad cloud services, which in turn are complex, distributed systems themselves. The Cognitive Overload has truly become overwhelming.

It has become impossible for an individual to fully comprehend complete systems. It requires large teams of experts owning responsibility for individual components of the system, who also understand how they work and interact, to jointly comprehend system behavior. Addressing outages and managing SLOs is creating toil on Ops teams which is beyond manageable without significant intelligent automation.

This complexity is even more exaggerated in large enterprises. While startups and other ‘newer’ organization may be building extremely complex and distributed systems, their greenfield nature provides them with clean canvases to build the complexity upon. They have but one technology stack, no matter how complex the code, its a very well defined set of technologies to deal with. Very cute… Large enterprises on the other have multiple technology stacks. We as an industry are great in adopting new technologies, but suck at retiring old ones. Enterprises that grew thru mergers and acquisitions have it the worst. For some of my consulting gigs I always joke that if I get off at the wrong floor at some of my clients its like walking into another company. One floor is building modern cloud-native apps, and the other floor looks like a data processing shop from the 90s. And that’s because it is. COBOL and CICS running on the mainframe coexist with microservices deployed to Kubernetes, talking to SAP running on AIX systems, calling Java code deployed to a out-of-support version of Weblogic. All living in a hybrid-cloud, multi-cloud environment. Thats a typical enterprise stack for you.

We as an industry are great in adopting new technologies, but suck at retiring old ones.

But all is not lost. There is a a new set of technologies, a new set of industry ‘trends’ advancing towards addressing the complexity causing the cognitive overload. As David Kirsh said two decades ago, Cognitive overload is a fact of life. It is here to stay. In fact, it is increasing with time at an accelerating rate. We need to step away from the keyboard and look at the systems we are developing and develop tools to properly cope with the complexity in order to manage the cognitive overload. But we also need to ask how do we truly ‘grok’ the complexity of the systems we need to deal with. How do we reduce the complexity by introducing the right abstractions. These trends I see gathering steam are doing exactly that. They fall broadly into two areas:

  1. Abstracting complexity
  2. Grok-ing complexity

In this post I will look at Abstracting Complexity, and look at Grok-ing Complexity in part 2.

Abstracting Complexity

Ever since the advent of virtualization, there have been attempts to manage complexity by raising the level of abstraction of the underlying infrastructure. First it was the automation of provisioning and configuration of large sets of servers, or clusters of servers. The advent of technologies like VMware, OpenStack, and Mesos where the first set of salvos attacking this desire to abstract beyond the single server as an atomic construct, to massive sets of clusters of servers.

Enter Containers

The current salvo is being driven by the movement to Containers. Docker won the container war, but lost (Docker the company that is) the abstraction war. Abstracting to a construct higher than the server required the ability to orchestrate large, very large clusters of containers. That capability came from Kubernetes, which allowed us to manage containers at scale and move away from the server as the abstraction of infrastructure. That being said, Kubernetes itself, given its architecture, elegant as it may be, has led to even more complexity when running, managing and securing Kubernetes in production. For those using containers and the environments delivered by Kubernetes, the world became elegant. A developer did not need to worry about the server types and ‘it worked on my machine’ issues. She found a wonderful world where she could deploy any service as containers, in any environment, from her laptop all the way to production, and have it elastically scale as she desired it to. New versions of services could be deployed seamlessly from one environment to the next without complex deployment processes and scripts, and without having to deploy only when everyone else was ready to deploy their services too. But for the teams operating Kubernetes, this was a new level of cognitive overload. Just google ‘Hitler uses Kubernetes’ and you will see what I mean.

For those operating Kubernetes in production, there was however always one challenge with the ‘abstraction’ promise of Kubernetes, or containers in general. The primitive construct of deploying Kubernetes Pods was still the server. There was still the need to provision, configure and manage infrastructure, or the underlying servers upon which the pods are going to run. This managing, securing, patching etc of servers upon which Kubernetes pods are deployed kept servers as the lower atomic level of abstraction that the operations teams needed to work with.

New MicroVM technologies have allowed for the development of compute offerings that offer a ‘serverless’ infrastructure construct upon which Kubernetes can be deployed. AWS Fargate is Amazon’s offering in this space. It allows EKS (AWS’s managed Kubernetes Service) pods to be run with the compute being provisioned elastically, based on the resource needs of the pods deployed, without any need for the operations teams to operate and manage any underlying servers. The lowest atomic construct now becomes the Kubernetes Pod.

Serverless aka Function as a Service

Last but not the least, there has been the advent of Serverless. Yes, the term is overloaded and means two different things in two consecutive paragraphs of this very blog post. What is meant by Serverless in this context is truly ‘Function as a Service’ (FaaS). AWS Lambda was the first to launch here, offering application code to be executed with absolutely no provisioning of any compute resources required, not even containers, and 100 milli-second level billing. All cloud vendors today offer some form of serverless, or Functions as a Service. I was with IBM Cloud when we launched IBM Functions based on Apache OpenWhisk. Developing and deploying using OpenWhisk was Nirvana for me as a developer. I just needed to write the code, deploy and run. It is as if the servers weren’t even there (pun intended). But this abstraction away from the infrastructure instances and complexity of managing them comes at a price. As one would expect, serverless/FaaS is not a fit for all types of applications. Event driven applications tend to be the best fit, as opposed to other more complex systems, such as low latency transaction systems and real-time systems residing at the other end of the fit spectrum.

 

The complexity – application type fit calculus is fairly straightforward. The higher one goes up the abstraction stack, the lesser the spectrum of applications that the abstracted environments are suitable for becomes. Anything can run on bare-metal with no abstraction whatsoever. Pretty much anything can run on VMs too if you can afford the ‘Hypervisor tax’. But as we go into containers, and then serverless, as the complexity of the infrastructure become more abstracted away from the underlying hardware, the spectrum of applications, or really architecture types that are a good fit, becomes narrower.

Architecting Complexity

Complex, distributed systems tend of to have a mix of multiple architectures. Each application or even each component should hence be evaluated against the complexity of the underlying infrastructure abstraction level which is best suited for it by the architects designing the systems. The goal is the same for the architects as has always been – to make architectural decisions evaluating all the trade-offs of selecting any architecture type for its pros and cons. What we thankfully have today is a broad spectrum of options, which will probably get even broader over time (we did not even discuss PaaS in this post), allowing us to address the complexity, and the cognitive overload inherent in complex, distributed systems.

In part two of this post, I will explore how to understand the behaviors of such systems. The discipline of Chaos Engineering is the trend focussed on grok-ing system complexity which we will be taking a look at. Stay tuned.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.