13 Apr 2022 10:00am, by Nati Shalom
Nati Shalom is the CTO and founder of Cloudify.io. He is a serial entrepreneur and widely published thought leader and speaker in open source, multicloud orchestration, network virtualization, DevOps, and edge computing. Nati has received multiple recognitions including YCombinator and is one of the leaders of the Cloud Native and DevOps Israel Meetup groups.
Many modern software applications are highly distributed, meaning they run on multiple and often diverse infrastructure environments at the same time.
Typically, these highly distributed applications share data that is spread out geographically. Well-known, high-end use cases include email and the internet, telephone and cellular networks, aircraft control systems, ride-share dispatch systems, and systems that track inventory in your local box store.
Today many modern SaaS-based offerings are highly distributed as well, because they need to ensure low latency access across global distributions (as in the case of Zoom or Netflix ) or meet regulation requirements such as GDPR that prohibit customer data from crossing specific geographical boundaries.
All of this makes distributed architecture more common and even mainstream. In fact, most SaaS companies today manage their data across multiple sites and regions and use hybrid SaaS architectures where some services run on-premises and some in shared cloud offerings.
Unfortunately, most of our current DevOps automation tools were not designed to support such distributed and hybrid architectures, and that leads many software companies and enterprises to build custom frameworks and processes to deal with the specific challenges imposed by those architectures.
When all of your IT infrastructure runs in one place, you have fewer moving parts to coordinate than in a distributed system, where you may have dozens, hundreds, thousands, or even millions of endpoints where your software needs to run correctly. In distributed systems, the complexity of IT operations is not only multiplied by the sheer number of endpoints but also compounded because everything needs to work together.
Each physical and virtual environment upon which an application runs in a distributed environment can incur its own operational challenges, ranging from common to extreme. Commonly, for example, slight variations in infrastructure can cause the system to drift beyond original configurations. Even something as simple as a change in a security group can throw things off, causing a previously accessible port to be unresponsive.
Infrastructure drift is much more likely in a distributed system than in a centralized system, primarily because there are more moving parts and the parts are evolving independently of one another. Moreover, each endpoint in a distributed system can be subject to extreme challenges, such as weather events that cause massive power failures or damage to physical equipment or facilities.
How, then, can you manage continuous deployment to multiple endpoint environments when those environments are constantly changing? Unfortunately, this is the reality of distributed systems, and it’s why the job of deploying and updating highly distributed applications is fraught with potential failures.
The Day 2 Challenges of Managing Distributed Environments at Scale
Let’s look at a simple example. Suppose your company has two offices, one in the east and one in the west, and each runs your customer relationship management (CRM) software on its own servers to reduce lag and comply with data sovereignty regulations.
Now suppose your IT team wants to push a software update to the CRM application. An update needs to go to the east and to the west. What happens if it fails in the east but succeeds in the west? You’ll need to push the update again, but only in the east. You’ll need to write a completely different deployment process for this partial failure scenario.
Of course, the example above is pretty simple. So, let’s scale it to something more realistic for today’s enterprises — multisite deployments.
In this case, the DevOps team is running a CI/CD pipeline and an update is ready to be deployed to 10-50 Kubernetes clusters across multiple regions. The deployment process within the CI/CD pipeline is a task-based system, where code is written and deployed under the assumption that everything downstream is running as it should.
Ideally, you write tasks, and the system executes the tasks in the order you prescribe. But what if something is off-kilter? Then the process doesn’t work. So if you deploy to 10 Kubernetes clusters, how do you know if the update is successful at each location? Is it running? Is it not running? Which part is running? Which part failed?
Let’s take it even further: What if the update fails in three of those places? How do you determine why it failed? (The cause could be different in each location.) Is there drift? If you don’t know, how can you successfully update into an environment of an unknown state? How do you continue the update process from the point of failure? You don’t want to have to update all 10 sites over again; you only want to update the three that failed. How do you roll back in case of a major defect?
Complexity, Driven by Edge-First Environments
As you can see, as modern DevOps teams strive to rapidly innovate and push software updates multiple times a day in highly distributed systems, the challenges mount.
Most teams have insufficient insight into the current environment at each endpoint; therefore, failures take time to investigate, and often unique tweaks and fixes are needed to handle each change in the state of the distributed system.
That’s why DevOps engineers are doing so much hand-coding. Engineers are finding they must stop the normal CI/CD flow, investigate what part of an endpoint infrastructure is not running, and then make manual tweaks to the software and deployment code to compensate for the change.
Here’s the thing: there will always be changes to the system. Infrastructure environments never stay static, and therefore a lot of “continuous deployment” systems aren’t really continuous at all. Because DevOps engineers don’t always know the state of each endpoint environment in a distributed system, the CI/CD pipeline can’t possibly be adaptive enough.
In the end, the process of ensuring continuous deployment in distributed environments can be extremely burdensome and complicated, slowing the pace of business innovation.
So, how can DevOps teams efficiently manage continuous deployment and software updates across such highly distributed environments without being swamped by Day 2 issues?
As edge-first implementations become more mainstream, forward-thinking organizations should consider using open source solutions that are more suited to distributed environments.
One such open source project is Cloudify.
Removing Deployment Complexity in Distributed Environments
With open source Cloudify, DevOps engineers can offload the complexity of managing the state of distributed environments. In Figure 1 (below) you can see on the left half of the illustration two parts of a pipeline—one part that deals with application development and testing and another part that is responsible for creating the environments in which to run the software, including development, test/QA, and/or production environments.
It’s this second part—provisioning and maintaining the environments — that is offloaded to Cloudify. The DevOps team simply sets up their environments exactly the way they want them, then the software manages those environments continuously (as depicted on the right half of Figure 1).
A Closer Look: Environment-as-a-Service in Your CI/CD Pipeline
Commonly, a CI/CD workflow is written under the assumption that the infrastructure maintains its originally configured state. Unfortunately, if the state changes, the workflow gets broken. This means that users need to constantly update their workflow operation to handle changes in state, and if this process of updating and revising is required too frequently, the entire CI/CD workflow becomes more manual than automated, which defeats the purpose of CI/CD in the first place.
Cloudify instead assumes the entropy of infrastructure, that is, left to its own devices, infrastructure will always drift over time, especially in highly distributed environments.
To solve this problem, it uses a declarative approach which separates the state of an environment from the workflow. Environment-as-a-Service (EaaS) technology keeps track of the state of each environment and how it changes over time. Within each environment, it knows all the components — compute, storage, networking — how they are configured and how they relate to one another.
As time goes by, the software monitors and detects drift and provides built-in workflows to automatically fix some of the common drift scenarios. The software feeds information about the current state of each environment back to the workflow, thus allowing the workflow to be adaptive to change.
In addition, Cloudify uses a transactional workflow mechanism that continuously tracks the state of execution and can therefore resume failed workflows or trigger a rollback workflow.
In the version 6 release, this mechanism was extended to handle bulk operations in distributed environments. The software can execute workflows in parallel in hundreds or thousands of environments or sites simultaneously and ensure successful execution of those workflows, even in cases where there has been a network outage or failure of some part of the system.
All of this is visible in Cloudify’s map view (Figure 2), which allows users to see which services are running per location as well as get the state of the overall cluster at a glance.
The map can handle thousands of deployments in thousands of locations. In addition, a new Deployment view allows users to switch between the map and table view and allows users to execute operations such as Day 2 workflows or deployment (provisioning) on the cluster from that view.
These are just a few of the capabilities that open source Cloudify offers to SaaS companies looking for a robust enabler of multisite management and highly distributed computing.
It’s All about Improving the DevOps Experience
The main goal of the Cloudify community is to abstract a large part of the complexity that burdens DevOps engineers and IT operators; specifically, we’re trying to make the job of managing distributed systems as similar as possible to running any other cloud service.
The idea is that DevOps engineers should be able to push code into Git to describe the desired end state of their system, offloading the job of figuring out the delta between the current state and end state and doing what’s needed to keep the CI/CD pipeline running at a rapid pace.
As an added bonus, whether you’re using Cloudify to manage a dozen environments for your in-house development team or billions of endpoints in an advanced edge use case, the environment-as-a-service capability synchronizes the efforts of DevOps and IT management and can help your team break down some of the biggest silos that exist in enterprises today.