Multi-Cloud Snake Oil

tl;dr Most multi-cloud management solutions are pure snake oil. Buyer beware. And you most likely don’t even need Multi-cloud.

Let’s take the scenario of an online Flower vendor. They sell bouquets and arrangements of Flowers via e-commerce and deliver them across the country. They have two big events year which account for around 92% of their annual revenues – Valentine’s Day and Mother’s Day. Sorry dudes, Father’s Day does not make the list. These two events, especially the 3 days each leading up to them are critical. An outage of their website of even a few minutes is impactful enough to cripple the profitability of the entire year.

Lets take another scenario. This company is developing cutting edge AI. They have a complex model that searches for the next generation of antibiotics for resistant bugs. Their model consumes large amounts of data. They also consume massive amounts of GPU cycles. Once they start the number crunching, the model can run for days before any results are produced (or not). Most results are not useful, and the model needs to be run again and again with fine tuning every iteration. Every run costs hundreds of thousands of Dollars for the GPU cycles used.

Third Scenario. This company is developing a virtual Appliance. This Appliance, when deployed on the client’s infrastructure uses machine learning to learn network traffic behaviors, communication patterns between services deployed, and usage habits of users. It uses these learned patterns to identify malicious behavior in the system. When a service connects to another which it has never connected to before. When a server opens a connection to another using a protocol it has never used before. It is an intelligent cybersecurity system. It needs to run on whatever infrastructure the client is using.

Multi Cloud

The above three scenarios are examples where a case for Multi-Cloud deployment can be made. The Flower Vendor needs a fail-over to not just another region, but another cloud to ensure 99.999% availability during their critical event days. The AI needs to deploy where the GPUs are the cheapest and be able to move to another cloud between runs, if the GPU prices or capabilities are more favorable. The cybersecurity appliance vendor needs to be able to support and run on every cloud to be able to deliver what its clients need. Most companies expressing the desire and intent to be Multi-Cloud, do not have such needs. The need to be Multi-cloud in most cases is Snake Oil being sold to them by Multi-Cloud solution vendors.

The need to be Multi-cloud in most cases is Snake Oil being sold to them by Multi-Cloud solution vendors.

The Case for and against Multi-Cloud

What are the promises of Multi-Cloud? Why is it the buzz-word of the times, from consultant pitches, to board meeting strategy roadmaps, to conference presentations? The allure is enticing. The promise is well, promising. To be able to go where the compute is cheapest, to avoid vendor lock-in, to be not dependent on one vendor’s ability to deliver the resilience promised. These are all valid reasons to adopt a multi-cloud strategy. The reality is unfortunately far from being able to deliver on any of these aspirations. And may never do so.

Let me make one thing clear. My goal is not to dash hopes and aspirations. Not to insult the ones pontificating on the need or multi-cloud. Not to bash vendors pitching multi-cloud solutions. Well, maybe these are my goals. But the true intent is to call out the emperors lack of clothes. Achieving multi-cloud deployment is not easy, in most cases not possible. And most importantly, for most organizations, not needed. Going down this path blindly can be painful and very expensive.

At this point let me make my case by examining the two key selling points of multi-cloud one by one:

Portability:

AWS us-east-1, right up the street from me here in Northern Virginia, goes down. The flower vendor fails-over to us-east-2 in nearby Ohio. No luck, also down. According to Twitter, AWS is having major issues in us-east-1. (AWS status page is as always showing all green). Everyone is failing over to other regions, causing a cascading impact across all regions. The flower vendor fails over to Azure region East US. Its also in Norther Virginia, but it is up. They are back in business. Minimal loss to revenue. Just a few terrified last-minute-shopping-dudes having to consider the option of buying something more original than flowers for their Valentines.

This promise of portability is enticing. It is not unreasonable an expectation for such a drastic case. Not everyone though needs such contingency plans. Such cascading failures are rare. But the value of portability is great irrespective. What about the AI company in my example above, For them the cost of GPUs outweighs everything else. Or the ability to deploy to the latest GPU/TPU chipset. What is their current cloud vendor is not the cheapest, or the best GPU provider? They need to be able to move to another vendor. This is against rare. And as we will discuss later in this post, the cost of moving data from one cloud to the next has to be outweighed by the compute cost savings to make the move worth it.

The third example – the cybersecurity appliance provider – is actually not one of portability. They just need to make an appliance instance for each cloud. They do not need to move the same appliance software from one cloud to the next. They just need to support multiple clouds, which is not the same as multi-cloud deployments.

So why is portability so hard? Three main reasons:

Data Gravity
Cloud Services
Migration Costs

Data Gravity: Moving data is extremely expensive. Most (all) cloud vendors charge hefty Network egress costs, that is moving data out costs money. As your data footprint grows, these costs become more prohibitive, making moving from one cloud to another expensive by the GB. Moving Data across from one cloud to another also takes time. A lot of time. Network bandwidth, no matter how high, limits the speed at which one can move data across clouds. The further apart the to and from cloud region locations, the pesky laws of physics increase that time by distance. Faster than light data transfer is yet to be invented.

The alternative is to keep the data at one place, even on-premises, and just move the compute to the cheapest or best compute. This plays into the issue of network latency. Is the network latency from having the data at a distance from your application acceptable? Your application wants to be close to the data to limit latency related issues. Hence the term Data Gravity. The secondary impact here is the data that gets created as a result of your application running. Do you move that data back to the location of your core data store and pay the network egress costs, or leave the newly minted data on said cloud? Leaving that data on the cloud results in an island of data being created which will need to be addressed as you move to another cloud down the road.

Cloud Services: Do you see the cloud as a set of cloud services (Infrastructure and Platform as a service) or do you see it as someones else’s data center? The true value of cloud is not in using it as a data center for rent, just moving expenses from CapEx to OpEx, with applications continuing to run and operate as they did on-premises. The true value lies in leveraging all the IaaS and PaaS services available on the cloud to maximize the value your application can get by running in the cloud. Why run and manage your own database when you can get a database as a service. Why run and manage your own Identity and Access Management service when you can get one as a service in the cloud? Why run and manage your own load balancer when you can get one as a service from the cloud? The 100s of services offered by cloud vendors allow for the delivery of applications in a more resilient and efficient manner, reducing overhead on the application developers and the operations teams. The catch however is that these are all tentacles that attach themselves to your workloads making them more and more dependent on these services which are all vendor specific. Even those based on open technologies with claims of ‘standard’ APIs vary significantly from one cloud vendor to the next. The more services you use, the more vendor lock-in you have. The less ‘portable’ your applications become.

I worked with a client a few years ago who made the strategic decision to not use any cloud vendor provided service to ensure portability. They were never really using the cloud. They were just renting servers from AWS. They never allowed their application to truly leverage the cloud. They never saw the benefits they could have. They also never moved the application off AWS which was the original intent. They could never make a business case to make the move for reason #3.

Migration Cost:Migration is not free. I just don’t mean the cost of moving your development and management stack, code, and data over to another cloud – yes the movement of data can be expensive as we discussed above. I mean the cost of the change of how you run and manage the application on the new cloud. You spent all the time changing over from an Ops team that had deep expertise in deploying, running and managing workloads on-premises to a team that had deep cloud skills. You retrained people. You hired people. You have a leaderboard of all the cloud certifications your team has earned. Now you want to replace them with skills for another cloud? Will your deep AWS skilled guys and gals hang around to earn the level 1 GCP certification and work their way up that ladder, or will they go polish the resume and get a job at another AWS shop? Same story for your developers who developed skills and expertise for your current cloud. Will they want to learn the nuances of developing for and deploying to another cloud? You built an application delivery pipeline that runs in the cloud. You achieved CI/CD. Now you want to tear this down and move to another cloud? Will the same tools work? Do they have the same features and capabilities, and maturity on the other cloud? Didn’t the vendor just start supporting that cloud?

Moving to another cloud means rebuilding your environments and your cloud management stack. You have to rebuild virtual machines, containers, repositories, storage buckets, VPCs, AZ/region deployments, redundancy and fail over architectures, DR plans, Accounts and sub accounts, and charge back mechanisms. Everything. And then do it again when you are ready to move once more. Why only move once if you want true portability?

Migration is not free.

Muli-Cloud Deployment:

The other promise coming from the proponents of multi-cloud is multi-cloud deployment. Deploying an application or workload to the cloud that is best fit for its needs. In terms of cost, services available, efficiency, network latency etc. The goal is then to manage them all from a central control plane no matter which cloud a particular service or application is deployed to. To handle the portability concerns I outlined above there are vendors who are selling solutions that offer an abstraction and a management layer which runs on multiple clouds. They then allow you to move workloads from one cloud to another with their solution managing the complexities, and facilitating deployment to any cloud of choice. And then allowing you to manage it all from one spot, handling the migration cost issues I outlined above. Brilliant, right? Snake oil!

You buy their solution. You are running on two public clouds. Their solution deploys an abstraction layer to both clouds. They allow you to manage it from one single control plane running on one of the cloud. You are up and running. Applications are deployed and in Prod. Their sales rep takes you golfing to some warm place. Invites to deliver the keynote at their next user conference along with their CEO. But where is your Data? Which cloud? Both, right? Do you have applications running on one cloud that needs data that happens to be on the other? Do you move data from one cloud to the other as it changes? Do you just have the applications query data across clouds constantly? You forgot all about my opening thesis on data gravity. You forgot about network egress costs. Wait for your bills, from both clouds now, and look for the data transfer line item.

Wait it gets worse. You have an issue. One of your applications has degraded performance. All SLOs are being tripped. Where is the problem? Is it in your application? Is it network latency to get data from the other cloud? Is it the multi-cloud solution’s abstraction layer? Is it an underlying cloud service they use? It’s the network, right? Its always the network! But where in the network and who’s network? Network internal to the cloud? Network at the multi-cloud solution layer? Network at the control plane level? Network connectivity to the internet? Who manages that again? The cloud vendor or the multi-cloud solution? Lets open a ticket with the cloud vendor. But why are not they responsible for anything thats running ON their cloud platform. They are only responsible for the SLOs of the services they provide. YOU are responsible for what runs on it. But what’s running on the cloud platform? Not your application. Your application is running on a multi-cloud abstraction layer. THAT solution is running on the cloud platform. They need to open a ticket with the cloud vendor – and wait in line all other commoners (customers) for the cloud vendor to service the request. Even then we may not know where the problem is if the cloud vendor has all its services at green status. Who owns the SLO now? Who do you escalate to? @Quinnypig? Can you talk about this at the keynote?

‘R’ as in RACI

Sure. I will get my comments and twitter responses filled with counter arguments of why it is not so. But I have lived this. It is actually worse. This is what the multi-cloud reality looks like. A client, a large global enterprise, I worked with had deployed their application on two public clouds. They had a multi-cloud solution deployed on both with their applications running on it. But when things went south there weren’t just two vendors trying to get to the root cause of the issue. There were many more. They had a managed service provider owning the infrastructure and application management. They had another vendor who actually developed the applications in question (luckily only one application development service provider). Another vendor owned the network. Their own team ran the SOC. When the service started degrading in performance all these teams were on the an emergency call. Lets count them out:

Client application owner(s)
Cloud Vendor 1
Cloud Vendor 2
Multi-cloud solution vendor (The vendor was one of the cloud vendors itself, but a different business unit from the public cloud operations)
Application development vendor
Managed services vendor
Network vendor
Client Security team (different from the client application owner. Included as there was the possible the issue was due to a breach. A DDOS attack had already been eliminated)

I would rather get an emergency root canal done than live thru that again. At least the dentist numbs you. One of my colleagues and I were managing (or really failing to manage) the mess. We brought out the RACI document that had been developed to see who owned what. At over 200 rows and 8 columns, it was totally inadequate. We were counting on the monitoring regime deployed by the multi-cloud solution to provide observability data. It gave us data that was limited to their part of the stack. Both cloud vendors blamed the multi-cloud solution. For one cloud vendor it was their own solution, but even they blamed it. They actually just walked out the moment they heard the application in question was running on the other cloud despite it having a dependence on applications running on their cloud. The cloud vendor hosting the application declined to escalate any tickets opened (these were account team people, not an incident response team). They did not see this as their problem unless it was proven it was theirs, especially given it was a competitors multi-cloud solution running on their platform. The managed service provider was blaming everyone else. Their Runbooks had already been walked thru. But the Root Cause Analysis (RCA) only works when there is visibility across the entire stack(s). The application development vendor had not made a deployment in over two weeks so their were claiming ‘innocent bystander’ status. The network vendor was showing their status page claiming all green too. Each vendor only had responsibility in their limited area and had no reason to take responsibility outside of the space. They all just blamed the multi-cloud solution, and its vendor could not prove otherwise, given their dependence on multiple underlying cloud services. It was a epic failure of the DevOps way.

In conclusion, (to end the mystery of the situation above, it WAS the network). In conclusion, multi-cloud is hard. It is expensive. No one in the Enterprise world has teams with the breath and depth of skills to do it. Building teams to support two clouds really well is difficult. The solutions that promise multi-cloud management are immature at best – their powerpoint is way ahead of their code. Most are pure snake oil.

The solutions that promise multi-cloud management are immature at best – their powerpoint is way ahead of their code.

And most importantly, you most likely don’t need go down the Multi-Cloud path. If you think you do, step back forget the presentation you saw at the last conference which had a killer case study on multi-cloud adoption. Take a deep breath and explain to me why you can’t do without multi-cloud. Yes, you will deploy independent applications to multiple clouds. But they will remain there. They will be run and managed there. They will be mostly isolated from each other. That is adopting multiple clouds, not multi-cloud. Its like adopting both Linux and Windows in the old days. That is not the multi-cloud pitch you have been sold. You don’t have a need for that.

Sanjeev Sharma, Principal Analyst, Accelerated Strategies

Would you like to talk about your experiences with Multi-cloud adoption? Would you like to get some guidance on how to address your challenges? Would you like me to talk to your leadership team to walk them off the multi-cloud path? You can request a time to talk to me.
Sanjeev

Home: sdarchitect.blog

2 Comments Add yours

Leave a comment Cancel reply