Progressions are always interesting things to look at. That’s particularly true when they represent a change in thinking about a topic that’s important to both buyers and sellers of technology. It’s perhaps even more true when the progression we’re considering isn’t really recognized, and so its value and future importance isn’t recognized either. That’s the case with the progression related to IT and network operations.
Back in the good old days, we talked about “network management” in terms of the acronym FCAPS, for “Fault, Configuration, Accounting, Performance, and Security” because those five elements were accepted as the things that had to be known about networks. It doesn’t take a rocket (perhaps we should say these days, an “AI”) scientist to recognize that you could assign those five parameters to data centers, the cloud, and probably a lot of other stuff as well. Today, though, of the 244 enterprise tech types I drew some operations comments from, only 137 professed to have heard the acronym and only 48 defined it correctly. So much for the good old days.
What do people talk about now? That group of 244 offered a bunch of terms, like observability (209), “closed-loop” (178), “actionable intelligence” (77), “digital twin” (41), and “contextual” (32) among them. In a sense, even the terms themselves can teach us something, because FCAPS was a categorization, and now the focus is on properties and realizations. Digging deeper, though, might uncover something that even meets the “actionable” goal.
Technical systems, like almost all systems of multiple cooperating elements, can be viewed in two distinct ways. The first is as a collection of elements that present a set of properties related to their operation. What we might call “monitoring” or “logging” collects those properties, and we could relate our FCAPS categories to the what the monitoring/logging process collected. The second thing we could view our system as is a state, which represents not the elements themselves but the result of the cooperation.
To me (at least) the “observability” movement is first justified by recognition of the concept of state or collective and cooperative functionality. However, that connection isn’t always made. Instead, discussions on observability have tended to focus on ways of adding information about the elements, primarily by introducing explicit software probes that can report on information flows from inside, where flow meanings are known. I think that’s important as a technique, but it can’t be allowed to overshadow the question of what you’re “observing”. To me, observation is an exploration of the visible properties of something, not an examination of the parameters of the things that make it up. Thus, its “state.”
One big challenge, then, is how to determine state, and this is where I think there’s congruence between the goal and the mechanism. Introduction of probes into software gives you a look at the flow of information, which is what the user is actually interacting with, or seeking. You could assume that observability means reading the state of information flows through the network, the “sessions” that represent user-to-resource relationships. But, and this is obviously a big question, is that enough? Remember that “closed loop” and “actionable” got top user scores, and both imply being about to turn information into remediation.
You can’t fix a virtual problem. You have to relate it to real-world conditions, which means abnormalities in individual elements of the network. If observability starts with a black-box view of infrastructure, we have to be able to get into the box if we want to turn observing into fixing. Making network operations actionable is, as I noted, a key goal, and to many users that means at the minimum having systems suggest remedies and not just identify causes. There are a number of ways that could be done, ranging from traditional and easy to transformational and complicated.
The easy way would be to assume that service failures at the high, functional, level could be assumed to be caused by abnormal conditions being reported below. Even if you couldn’t dive down through the boundary of the black box and identify causes directly, it’s fair to assume that if you fix all the reported breaks, you’d likely fix any service problems. Fair, but not totally true. A problem with a resource might or might not be recognized as a fault by the management system. Do all congestion and queuing problems show up that way? In many cases, you can examine MIBs to detect those issues, but they may not generate an alert. In any event, very few users would trust this sort of problem determination to drive a remediation process directly, and most (almost three-quarters) say they wouldn’t even want remedies to be suggested under this approach.
Another possibility is to use root cause analysis in some way, including AI/ML, to examine all the conditions in the network, from the functional observations down to device status, and draw conclusions. This is what’s seen by most enterprises as the state-of-the-art approach, but the users say there’s a wide variation in the success rate for root cause analysis of any sort, even within the same network. AI/ML analysis, even trialed by only a fifth of users so far, is said to do better but still not achieve “great success”. As a result, there’s a tendency to close the loop with cause-to-remedy coupling only within a part of a network (data center, for example) or within the domain of a single vendor.
Could digital twins help? The idea of using digital twins for network operations support isn’t new. There are some references to the idea that go back as far as five years, and there are scholarly papers on the topic. There was a panel on the topic at MWC as this article describes. There have been a number of initiatives aimed at digtal-twinning networks, but most of them seem to have been aimed at a planning and optimization mission, or perhaps at a slice of the operations puzzle like 5G. There are only a few companies who have targeted network operations as a digital-twin mission, including Nokia who is both the biggest player to have done so and the one who’s taken the most aggressive position. So far, I can’t find anyone who has actually taken the idea to what I think is the limits of its potential, or even described their approach in detail.
Digital twin structures have the inherent ability to correlate conditions, determine state, and even predict the results of changes. If you combine them with simulation and AI, they could come as close to the results that a superbly qualified ops team could expect to produce. The most important thing about the digital-twin option is that it could, just as that qualified ops team could, take effective action to remediate problems. It could finally close the loop. I don’t think we’re there yet, but the limited initiatives that have been undertaken and publicized so far suggest that we’re at least moving in the right direction. If there is a grand future for network operations, I think this is the path to finally reach it.