There are a lot of moving parts to cloud-native, some of which (like state control) I’ve already blogged on, and some of which have received considerable attention in the media overall. If you look at the big picture, you find the strategies for achieving cloud-native behavior fall into two groups. One is the software design side, with the stateless-microservice discussions, and the other is on the operations side with things like orchestration (Kubernetes).
An operator and an enterprise contact both asked me to talk a little about two other issues, ones that I’ve neglected but are still important. It occurred to me that these two other issues really give us an opportunity to consider the totality of cloud-native. They’re concurrency and persistence, and not only are they fundamental to cloud-native concepts, they might even unite operations and development.
“Concurrency” the concept of running elements of an application in parallel instead of in sequence. This has been an issue for decades because it’s often necessary to “wait” an application’s execution while something external, like an I/O operation, is handled. More recently, network-centric applications have used concurrency to make better use of processor/memory resources by handling things that aren’t really tightly connected in parallel. The Java programming language illustrates a common approach to concurrency—threads. A thread is a parallel process set that runs alongside something else, but threads are programmatic concurrency accommodations within a logic set.
In the cloud, concurrency is a natural property of multi-component operation. If an application consists of multiple, independently hosted, components, it’s almost certain that the components run asynchronously and concurrently. In fact, if you have “components” that never run at the same time, I’d argue they shouldn’t be separate components at all.
If separately hosted, cloud-native, component truly runs concurrently, then besides the usual cloud-native state management and related stuff to consider, you have to consider what concurrently operating components means in workflows and design. Grid/parallel computing and some big data tools offer an example of this challenge.
Grid/parallel computing presumes that you can take massive computational tasks and divide them up among multiple systems for execution. The key to making it work is to find computational tasks that are independent until their results are finally correlated for delivery. In short, you have to be able to separate the overall task into a set of autonomous subtasks. You then farm the latter out, and make something responsible for a final summarization of results.
Big data, particularly in the map-reduce-Hadoop model, uses a combination of clusters of data, distribution of queries, and “reduction” or combination of results. This means that the queries are divided and executed in parallel, another example of concurrency. In this case, the original data-hosting, the query-division, and the data-combinatory processes ensure that you actually get concurrent execution and assimilation of results.
Concurrency of this type, meaning what’s essentially “concurrency within an application” is complicated. In the cloud, we might have these issues, but we often have a different kind of concurrency, one where the parallelism of component execution is created by the need to service multiple events or transactions at the same time. Event-based systems that process many unrelated events, or front-end cloud systems that process parallel transactions from many users, are examples of this kind of component concurrency, and it’s the easiest kind of concurrency to manage because the parallel tracks are independent.
The hardest kind of concurrency is related to it, though. If we have an event-driven system whose events are created by elements that are cooperating to create a single behavior set, then those events occur asynchronously and can be processed to a degree in parallel, but they still have to be processed with the recognition that the system as a whole is trying to operate and all the parts and processes have to be coordinated. This is the special problem created by lifecycle management of multi-element applications and services.
Suppose you spawn an event to signal that you’ve started to deploy an access connection in a given area as a part of spinning up a VPN. That specific piece of the VPN is now transitioning toward operational, but not all of the VPN may be spun up. If there are a dozen different geographies being served, a dozen access connections are needed. We can’t release the VPN for service until they’re all operational.
If that’s not complex enough, suppose, as we’re spinning up our access connection, we find that a key resource has failed. A VNF hosted in a given data center can’t be hosted there because of a fault or overload. This is an error condition that has to be processed, and we’d all agree. But we might have provisioned the deeper part of the VPN to create a connection to the data center we expected to use, and that’s not available. We have to “back out” that prior step and connect to another data center. While we’re trying to do that, there might be a fault in the original data center connection part of the service. We have two different things now trying to impact the same area of service at the same time. Why remedy a problem in something we’re decommissioning? And suppose we got those two events in the opposite order?
This is why you need to have state/event mediation of the event-to-process correlations for each piece of a service. We could structure our state/event tables to say that 1) if we were in a “decommissioning” state, ignore fault events, and 2) if we were in a fault state and received a decommissioning event, let that take precedent and suspend fault processing.
Note that in this most-complex kind of concurrency, we have a benefit to explore too. Any event can be processed by a new instance of the component, spun up in a convenient location, as long as that instance gets the event passed to it and has access to the service data model that contains all the state and stored data required. Even a fault avalanche can be handled, both in terms of mediating how events are handled and in terms of creating enough processes to handle everything that arrives.
OK, that’s concurrency in cloud-native terms. Now on to “persistence”. I mentioned in a blog last week that you could differentiate between a classic “microservice” and a “function” by saying that microservices were persistent, meaning they stayed where you put them till they broke or you removed them, and functions were transitory, in that they were loaded only when actually needed.
To me the best argument for saying that microservices should always be stateless is that if they are, then the decision on persistence can be made operationally rather than at development time. If a component is expected to be used a lot (or if regular use is detected through monitoring) you could elect to keep it resident. If not, you could save resources (and, in a public cloud, costs) by unloading it.
You could also decide to keep components resident when idle unless you needed to use the capacity for something else. That would make the process persistence decision very dynamic and very much aligned with the actual needs of the applications/services. More efficient resource use equals lower costs.
In our lifecycle management examples, you can see that the ability to have a process selectively persistent would be a big benefit. If things are running normally, then error processes would be invoked rarely and so unloading them while idle could be a big benefit. If things start to go south, operations systems could decide to make more error processes (the ones either in use or likely to be used, an AI determination) persistent to handle things faster.
Concurrency plus persistence control equals optimum efficiency, scalability, agility, and performance. The keys to achieving this are first, to use stateless processes fed with all the data they need as their inputs, and second to have a data-model-driven state/event steering of events to processes. If you have this, then you can always spin up a process to respond to need, and manage the inherent tension between disconnected event-driven processes and the services or applications that necessarily unite them and demand coordinated handling.
An important point here is that “cloud-native” isn’t a buzzword, but neither is it a simple one-dimensional concept. To take advantage of the cloud, you have to support concurrency, scalability, and resiliency in a way that maximizes cloud benefits. You also have to coordinate diverse autonomous but interdependent processes and foster agility in operations in a way that doesn’t mean rewriting everything to support new run-time conditions and issues. State is a big part of it, as is the mediation of the event-to-process relationships via an organized model of the application or service. It’s complicated, which is very likely why we’re having so much trouble coming to terms with it.
Another reason is that (as is often the case) we’ve tended to come at cloud-native from the bottom, starting with the software implementation of processes rather than with the architecture of the application or service within a hybrid/multi-cloud. If we started with issues like concurrency and persistence, we could derive things like stateless components, state control and orchestration, and event/process steering, and end up in the right place.
The “top” of “top-down” thinking is thus a bit fuzzy. Yes, you have to start with requirements, as Oracle did a blog on it recently, adding a link to a useful e-book as well. Yes, you have to end up with specific software design patterns and associated tools, as we’re seeing in the Kubernetes ecosystem. What’s in between, though, is the “software architecture”, the application framework into which everything that’s developed fits and to which all operations practices and tools must apply. DevOps, for example, is just a concept if you don’t have some specific framework for applications and services that build the bridge between the two pieces. We’re still not quite there on that framework, and the biggest challenge we may be facing in getting there is the “atomization” of the problem.
“The bigger it is, the harder it is to sell.” Salespeople know that it’s a lot easier to sell a can of car wax than to sell a car, and easier to sell a car than to sell a fleet. The problem with this tendency to think small and near-term thoughts in marketing and selling technology is that it doesn’t introduce the full scope of potential problems and benefits. That can limit the solution scope these early and fragmented initiatives expose us to, and that can create inefficiencies as time passes and broader needs and opportunities are exposed. I think the right model for cloud-native is already out there, but it’s not yet recognized because of think-of-the-next-quarter’s-earnings-myopia among vendors and buyers alike. It’s time to raise our eyes above our feet on this, folks. There’s a lot to gain…and lose.