What is the right model for the software that’s used to create network services? Sure, we can say that it’s “virtualized” or “containerized” or even “cloud-native” but are those terms really enough to define the properties of the stuff we hope will play a big role in network services? I don’t think so, but I do think we can evolve from them to some useful conclusions about what is needed.
A good place to start is with the concept of virtualization. Virtualization in a software-and-hardware infrastructure sense means creating a model of hardware which is recognized by software as real and specific, but can be applied instead to a pool of resources. This capability is what really permits the separation of software and hardware, and allows better scaling and availability. We need software that works in the virtual world in order to build our transformed network infrastructure.
The next question is just what a pool of resources really means. Obviously, it means more than a singular number of resources, and it also means that there’s a presumption of multi-use in play. We have a pool because we have multiple consumers, and the consumers need some redundancy. By sharing the pool, that can be provided at a higher level of efficiency, meaning a smaller number of resources. That, in turn, lowers both capex and opex.
If the consumers of resources, the features that make up services we’re creating, require different resource types, then a single pool won’t serve; we’d need a pool of resources for each distinct mission. In addition, a resource pool isn’t likely as useful, or useful at all, if there is at the point of consumption only one mission or user to share the resources. That means that the value of the pool is greatest at the point of largest mission diversity.
From this, we could say that the value of a resource pool would graph into something like a bell curve. Near the network’s edge at the left of our hypothetical graph, mission and consumer density is low, and so pool value is low. On the right, representing the core, we really have only one mission, which is to push bits. That mission can benefit from a resource pool, but not a pool of servers. Think instead of a pool of white boxes.
The middle of our curve represents the metro piece of the network, the part that begins at the inside edge of the access network, after primary traffic concentration has occurred, and ends when the traffic load effectively prohibits any service-specific manipulation of the network’s flows. It’s here, where we have an opportunity to add specific service features and where we have enough traffic to make economy of scale meaningful, that we need to think about the network as a cloud.
We have to start with a bit more on missions. Network services are made up of components at three levels, the lowest being the data plane, then the control plane, and finally the “service plane” where the actual user experience is created. The latter is where “higher-layer” services live, and where development efforts by web social media players typically focus. This service plane is really, or should be, cloud computing pure and simple, so it’s the data and control plane we need to look at more deeply.
Operators almost universally believe that large-scale data-plane handling is best handled by specialized hardware, which in an open-model network means white boxes with specialized switching chips. It’s possible to use traditional servers or white boxes without specialized chips in the access network or even metro-edge or edge aggregation, but operators who have thought their metro architecture through believe that using servers as data-plane handlers deeper in metro is likely to be justified primarily where there’s another set of missions that would justify creating a significant resource pool at the location.
Control-plane applications are a bit more complicated, because the term applies to both the data-plane protocol’s control-packet exchanges and coordination of data plane services that live a level higher, such as 5G’s control plane. To the extent that the control plane is tightly coupled to data-plane behavior, as it is with traditional IP networks, the control-plane processing has to be fairly local to the data plane or the latency could create a problem in performance during periods of “convergence” on a new route set. That doesn’t necessarily mean that control and data plane have to be supported on the same hardware, only that they would have to be supported in proximity to one another.
What we have, then, in a mission sense, is a service-plane requirement that would be hosted in a traditional cloud manner. Where this hosting was available, it’s likely that some local control-plane hosting could be provided, and even limited data-plane hosting. The configuration of the data center where all this was happening would have to ensure that the latency between data-plane devices and control-plane hosting points was suitable. Thus, white boxes might be co-located with traditional servers.
Within a given resource type, a pool could be created in the metro areas because of opportunity/mission density. Within that pool there needs to be some specific mechanism to deploy, redeploy, and scale elements, which gets to two topics. First, what’s the foundation of that deploy-and-so-forth process, and second, what can we say about the relationship between the foundation theory and the cloud’s specifics.
Virtualization requires that there be a mapping between the model element (a container, virtual machine, intent model, or whatever you like) and the realization of that element on the resources available. In networking applications in particular, given the fact that resource requirements aren’t uniform, we need to distinguish between what we’ll call resource sets and resource pools.
A “resource set” here means a collection of resources that are managed so as to support virtual-to-resource mapping. They have a consistent set of properties, so they’re “equivalent” in resource terms. A “resource pool” is an overlay on one or more resource sets that extend that equivalence through the entire overlay. Thus, three data centers would make up three resource sets, and could host one or more resource pools across some or all of those sets. The number and extent of pools would depend on the properties that could be “equivalent”, meaning that we could extend a pool as long as the extension didn’t compromise the properties that the pool advertised to the mapping process. So if we had adequate latency for “Pool A” across two of our three resource sets but not the third, we could define Pool A across the two suitable sets only.
What’s inside a resource pool would, of course, depend on the missions to be supported. I think it’s fair to propose that servers and white boxes would have to coexist as resource sets, and it’s also likely that there would be a combination of virtual-machine and containerized deployments, perhaps even bare metal. That would allow resource pools to be built considering the hosting capabilities of the resources, including special chip support such as GPUs or networking chips. We can then define “adequate” as including both the inherent features of the resources and their topology as it relates to latency and perhaps the risk of a common failure such as a flood or power loss.
For data-plane applications, as I’ve noted, higher traffic levels will require special networking chips, and that would create a resource set. Lower-traffic data plane applications and most control-plane applications would require servers, either VMs or containers depending on traffic requirements. Containers, being lower-overhead constructs, are better for lighter-duty control-plane activity because you can stack more of them on a server.
Just getting stuff deployed isn’t the end of the issues relating to the software model. Part of the adequacy consideration is in the workflow connectivity support available. Latency, obviously, is related to the efficiency of connectivity. A distributed application is an application whose components interact via workflows, and if the interactions are limited by capacity or latency, then at some point the connectivity support becomes inadequate and the resources or resource sets don’t belong in a pool that requires a better level of connectivity. But connectivity is more than just supporting workflows, it’s also about being able to find the next hop in the workflow, to balance work among instances of the same process, and to replace broken resources with new ones.
The point here is that “orchestration” isn’t just about creating a mapping between the model and the resource, it’s also about mapping the workflows effectively. The workflow mapping is likely more dynamic than the component-to-resource mapping, because there are conditions outside a given component/resource pairing that could impact effective workflow exchanges. For example, if you add a second instance of the component, does the introduction and the need for load balancing that’s created now reduce the performance to the point where the distributed application won’t meet spec?
The connectivity requirements created among mapped resource points identified by “deployment” mapping has a real-time, ongoing, component to it. In the cloud, the former is handled by things like DevOps tools or Kubernetes, and the latter by things like service mesh technology. There’s obviously an interdependence in play, too. Where you put something could/should be influenced by the connectivity that could be made available. This might be something that Juniper will be enhancing as it integrates its Apstra approach, because Apstra takes some resource set and resource pool operations activities from pure static deployment into something more adaptive.
You don’t need a service mesh, containers, or Kubernetes for this; a resource set should contain its own orchestration and connection tools. What you do need is some federation tool, something like Google’s Anthos, and you need to have the resource-set tools subordinate to that federation tool so you can apply global policies. Since we don’t today recognize “federation” at the connection layer, we would either have to create a service mesh across all resource sets to be used by the resource pools, or devise a connection federation strategy. Virtual-network capabilities like SD-WAN could be used to frame such a connection layer, but you still need discovery and load balancing, and the mechanisms for these could reside either with the connection-layer support (as it does with the Istio service mesh) or within the deployment layer (Kubernetes has some support for it).
Now for the question of “cloud-native”. In order for software to be scalable and replaceable on demand, the components have to be stateless, meaning microservices. They don’t necessarily have to be functions or lambdas, which is a subset of microservices where stateless behavior is achieved by requiring that the output always be a function of the input; no other data can be required so other forms of state control (like the most-popular back-end state control) are valid, if a purist view is taken.
Stateless data planes aren’t realistic; you need forwarding tables at the least. For the control plane, you could implement a stateless set of components if the interactions were controlled via a state/event table that maintained state externally, and invoked the components when an event was received. Protocol handlers have been written this way for at least 40 years. I don’t believe that serverless or function implementation of most control-plane activities is useful because you wouldn’t want to introduce the latency of loading the function every time an event occurred. Traditional persistent microservices would appear to be the best approach.
Management functions, especially service lifecycle automation, are part of the service plane and they’re an example of a service-plane application that can be treated as a control-plane application, with microservice and state/event implementation. This is what my ExperiaSphere project demonstrated, and also what I’d recommended to the NFV ISG as an implementation model. It’s based on the TMF’s NGOSS Contract approach, dating to the 2000s and since apparently dropped by the TMF, and I believe that it’s the model ONAP should have been based on.
The advantage of state/event implementations for control and data plane is that it provides for elasticity, parallelism, and all the good stuff we expect from cloud-native. There is a single specific data model that represents the system (the interface, service component, etc.) you’re automating, and within it there’s a state/event table that lists the operating states the system could take and the events that are recognized. For each state/event intersection, there’s an identified process (or process set) and a “next-state” indicator. When an event occurs, it’s looked up, the process set is executed (passing the data model as the variables), and the next state is then set. Again, this is how protocol handlers have been written for decades.
The nice thing about this is that it facilitates a kind of distributed model of event-handling. You could parcel out a service data model to the specific areas where a given feature/function implementation resided. Each of these “sub-models” would have the same properties in terms of state/event management and microservice implementation, and each sub-model could generate an event to a higher-level sub-model, or to the main model, when something required action that the area sub-model could handle, such as a total failure. There’s not much point to having distributable components in a cloud-native model without having a mechanism to manage them in a distributed way.
So that’s my view. We are not there yet with respect to this vision, but I believe we could get there pretty quickly. If we do, then there’s a good chance that we’ll have the formula needed for a transformation to a software-centric network future.