Adopting DevOps – Part V: Addressing Data Friction

Around 18 or so months ago (mid-2017) I started seeing a new trend in the DevOps adoption engagements my colleagues and I were driving. (I was with IBM at that time as their global CTO of DevOps Adoption, leading the cross IBM-business unit DevOps adoption practice). Clients we were working with were beginning to ask for help in addressing the bottlenecks in their delivery pipeline caused by the challenges that they were having delivering Data to non-production environments in a timely manner. Provisioning data was still taking days, creating severe delays in their highly automated Continuous Integration and Continuous Delivery (CI/CD) pipelines. ‘Data Friction’ had become the bottleneck they needed to address next. It is important to note that these were clients who were mature DevOps adopters. They had already addressed the first few key technology, process and organizational bottlenecks – especially those related to Application delivery and Infrastructure provisioning. They could provision any environment on demand, and deploy their application or (micro)service to it, on demand, via self-service by the practitioner needing to deploy. They, however, struggled to get the right data to the right environment when they needed it.

This is not atypical. Most organizations start their CI/CD journey of achieving ‘flow’ in their delivery pipeline by addressing deployment automation of application code, and the provisioning automation of environments. Some even adopting push-button ’full-stack’ provisioning for every deployment (Containers really help here, but that is a discussion for another post). Data provisioning as a part of the environment is typically the least automated set of steps. There is, of course, no need to provision a database with production-like test data in a test environment within minutes, if provisioning the infrastructure, configuring the middleware (including the Database software/Data service) or deploying the application takes days (or weeks). Even when we speak of automated full-stack deployment, the ‘full stack’ does not typically include automation to get data provisioned to the ‘full stack’ being deployed, which includes the (blank) Database instance as a part of the middleware in the stack. Provisioning the test Data is typically a separate step owned by DBAs who control, manage, secure and provision the data to the databases deployed. That is a problem. The DBA team cannot be left as a ‘silo’ apart from the rest of the teams. One cannot achieve true  CI/CD ‘flow’ without addressing data provisioning – getting the right data to the right environment and accessible by the right practitioners, at the right time, in a secure and compliant manner.

Full disclosure – I currently work for Delphix, which provides industry-leading solutions in the DataOps space.

The Data ‘Silo’

There are several reasons for the data ‘silo’ being left alone till date in most organizations. The foremost reasons being data security and compliance, and data storage costs. As developers adopt CI/CD, more and more builds get delivered, and testers and QA practitioners need to run more and more tests. They all need more dev and test environments, each needing its own instance of test data, creating challenges. The amount of data that needs to be stored and secured in these non-production environments is now exponentially higher. For a typical organization that is high on the DevOps adoption curve, it is not atypical to see dozens of non-prod data instances for each production database. I recently visited with a client who has adopted CI/CD and is actively addressing the data challenges we are discussing. For the 52 production databases they have for one of their business units, they on that day had 3,092 non-prod data instances provisioned. That is approximately 60 non-prod data instances of each production database. And they expect this to go up at least 10x as they further improve the automation in their application delivery pipeline, giving each dev-test practitioner their own data instance to work against, and then even allowing them to branch Datasets in these data instances, aligning with their development branching. Think one test data instance for each Git branch, for each developer and tester. All these instances need to be stored and secured. These become exactly the reasons why organizations cannot achieve this without introducing the right technology, processes, and culture. If each of the production databases in my client’s scenario had an average of just 100 GB of a subset of data that was needed for testing, the 3000+ non-production data instances would require over 300TB of storage. Obviously not an option. Even worse when you consider real database sizes and scale that across business units to hundreds of Databases in the enterprise with 100s of TB of data each.

Then there is the time and DBA resources it would take to keep all these databases refreshed and up to date with current data to make tests viable. These Data instances will also need to be regularly ‘rewound’ to the initial state after every destructive test that is executed against them. A far cry from most organizations where multiple Dev-Test teams share one single test database, where they keep stumbling upon each other when using, and which anyway was last refreshed weeks ago. There is also the need in parallel to manage and maintain the various versions of schemas that need to he versioned as the schemas evolve with code.

The security and compliance challenges also amplify as the number of data instances grows and the need to secure all this data in non-prod environments becomes a real blocker. Who can have access to which data set? Which environments can they provision the data too? Are there regulatory compliance constraints that vary by the classification or risk sensitivity of the data being provisioned, or even the geographical location of the environment the data is being provisioned it? To compound the challenge further, non-prod environments typically have lower levels of security hardening than production environments and even when hardened against intrusion, provide the proverbial ‘keys to the kingdom’ to dev-test practitioners who need these environments and access to the data. Now when the exposed surface area is say 60x larger, it presents a massive challenge.

Addressing ‘Data Friction’

The solution is to bring all the practitioners who manage, govern and secure Data – Data Analysts, DBAs, Security, etc – into the DevOps fold. Make them a part of the DevOps adoption initiatives no different from the Dev -Test – Sec – Ops practitioner already included, and help them adopt DevOps practices. From the process and culture perspective, this will get them on the path to becoming more agile and reducing the ‘impedance mismatch’ between their processes and those of other teams already adopting DevOps practices. They will start thinking in terms of addressing ‘Data Friction’. They will become a core part of the team of practitioners all jointly achieving ‘flow’ in the delivery pipeline.

To do this they will also need modern Data management toolsets in their arsenal. Toolsets designed with DevOps practices in mind. Such a toolset needs to have few core capabilities:

  1. The ability to take data from multiple data sources and database types and provision ‘virtual’ instances of data to any non-production environment in the delivery pipeline, via self service by practitioners who need the Data, as and when they need it
  2. Secure the data making it secure and compliant to regulatory and corporate controls, via policy and rule based governance controls, and data masking
  3. Have an API driven self-service interface that allows full integration into the CI/CD pipeline automation framework of choice, and gives practitioners the capabilities to manage, control, and collaborate around the code no different than how they do for code
  4. Have a single pane of control for management and governance of all Data instances under management, and all the virtual instances provisioned
  5. Be fast. No more waiting for data provisioning that takes days or hours. Be able to operate at the speed of the CI/CD tools in the pipeline

Addressing friction in the delivery caused by Data is the next step to level-up on the DevOps maturity curve that is being adopted now by organizations. Data access can no longer be the bottleneck slowing down the delivery pipeline. Addressing Data Friction is table stakes to achieving CI/CD ‘flow’ in the applications delivery pipeline.


5 Comments Add yours

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.