We don’t seem to be able to shake the topic of cloud-native technology in the telco world. It’s also getting merged with what I think is a more general topic, which is “telco cloud”. As I noted in an earlier blog, I’ve been fiddling with the design of a digital-twinning “metaverse-of-things” software model, and it’s teaching me some interesting things about both these topics. Will they advance the overall understanding of the issues? Who knows, and for sure they’re not going to end the seemingly endless discussions, but I think they may be at least helpful to some trying to gain some perspective, and telco planners agree.
I’ve gotten comments on “cloud-native” and “telco cloud” from every operator contact I have (all 88). Neither of the two terms is consistently understood, and in some cases I disagree with the definition that has the largest number of adherents. For example, I define “cloud-native” as “being developed explicitly for the specialized architecture of the cloud, and not suitable for standard data center deployments” where 52 of the 88 telco planners think it simply means “designed for the cloud” rather than having been migrated to it.
It doesn’t help that operator views of both terms are impacted by the way operators view hosted features and functions, which in turn depend on what they do with them, and even on their primary service geography. NFV, for example, is seen as a major example of both terms for EU operators, less so for operators in the US and Asia, and not much at all for those in developing countries. 5G is seen as an example of both by all 88 operators.
Nor does media/analyst coverage of the two topics help solidify either seller or buyer views. Every telco planner tells me that there is no consistency among either group, and all agree that lack of consistency in coverage means there’s less force being applied to accept standard definitions. Whatever you think either concept means, you can find somebody’s story that agrees. Finally, both these groups have (as I noted) tended to conflate the terms. Omdia did a report that talks about the essential role of GitOps (a repository-based strategy for development harmony) and cloud native, when many of the justifications offered are really more about rapid development in general, relevant to some aspects of telco cloud, than about cloud-native technology in particular.
With regard to “telco” views of either topic, I’m convinced that the big question is revealed in that “designed-for-the-cloud” definition. In NFV, for example, is considered to be “cloud native” by most of the 88 operators, even those who don’t think it represents an example of their own use of the technology. But NFV by definition is the translation of physical network functions to virtual network functions, which would make it a migration strategy more than a development strategy. I agree that to be cloud-native something has to be designed for the cloud, but that has to mean “designed to be cloud-specific or at least cloud-preferencing” rather than just being written to run there when it could also be run elsewhere.
What really qualifies, and what does not? I think 5G offers some indirect guidance there. 5G defines a control plane and a user plane. The user plane in 5G is really made up of the 5G network and some elements designed to gate traffic on and off. The control plane controls service behavior. It’s hard for me to see user plane implementations being truly cloud-native, but the control plane is another story. The reason is the relationship between “cloud-native”, “event-driven”, and “functional/microservices”.
The primary property of the cloud is “agility”. It can replace things that are broken, scale processes to match load requirements, and so forth. The problem in the data plane is that these benefits are constrained by two things, facility interdependence and context. You can surely pop up an instance of a router on demand, but routers need facilities to connect them, and fiber can’t be instantiated by a console command. There’s an interdependence there. But even if you pop up another instance of something in the data plane, and do so at a place where it can be connected, you have to concern yourself with the context of information that’s in flight, the relationship between packets consecutively sourced and their meaning. In the Internet and other IP networks, the TCP layer is provided to ensure that the adaptive behavior inherent in IP route determination doesn’t result in out-of-order arrivals that could cause an application fault. However, even TCP can’t address all context issues. Buffering can create problems with real-time applications because these applications often have to address the problem of missing events.
A generalized data flow can’t assume there are no contextual challenges in any of the conversations that make it up, which is why I don’t see data-plane/user-plane features as being inherently suitable for cloud-native implementation. Control reactions are another story, because it’s almost always possible to view them as being events in a state/event system.
A state/event system is a real-time system that has a finite number of operating states, and a finite number of signaling events. An event is handled based on its identity, parametric information, and the state of the system at the time. From the very early days of networking, one thing all such systems had to contend with was the “timeout”. When you send something that’s supposed to generate a response, you can’t just assume you’ll get it. If you do, and if the partner system failed or there was a loss of connectivity, the whole real-time system locks up waiting for something that will never happen. So you set a timer and when it expires, the timeout event triggers a reaction appropriate to handling the lost signal.
Any well-designed control-plane setup is going to be modeled as a state/event system, and the nice thing about that is that the model maps well to cloud optimality. You need a state/event process when a given event is received and mapped to the current state (via a table or graph). Until then, you really don’t need the process, which means you could spin it up as long as the process didn’t store information within itself. Good state/event design practices have always assumed that you could not store data in the processes; you either got the data from the event you’re processing or it was stored as a set of variables associated with the state/event system itself. Thus, you could spin up your indicated process not only when needed, but where it was best deployed, and in whatever quantity your need indicated. That, to me, is what cloud-native should really mean at the implementation level. This is also what IoT demands, what a metaverse-of-things would demand.
This is one reason why Nokia’s Technology 2030 stuff is interesting to me. Digital twin technology is characterized by its dependence on synchronizing a software model with a real-world system or device. That is, to my mind, inevitably state/event-based, and so Nokia might be committing to a broad use of this kind of software, which would then help them render their software elements in true cloud-native form.
You can implement an interface, the stuff that standards like 5G tend to define and also what’s found on most network-connected devices, as a state-event process. You can also do one in other ways, most of which are going to leave you with potential network deadlocks because a sequence of conditions results in being stalled waiting for something that will never occur, or being locked in an illogical state. The graph or table that defines what events are meaningful and what states the system can be in is absolutely fundamental to event-driven processes, and in my own view also fundamental to cloud-native development. I think that often gets overlooked because we focus on the processes, the microservices, which themselves are necessarily stateless. We need to focus instead on the things that invoke them, that understand state and context.