Operations, Automation, and AI

A recent piece in SDxCentral talking about the data issues associated with network operator automation projects was interesting to me as yet another example of the parallels between enterprises and operators. Enterprises looking at the use of AI in operations have made the very same point to me; the issue is far less one of the “right AI” but the challenges of visibility. You can either, it seems, see too little, or see to much. Or, perhaps, both. Which raises another point that enterprises raise, one that wasn’t mentioned explicitly in the article.

What everyone in ops wants, so they tell me, is to eliminate complaints. In short, management is really about QoE more than about QoS. The latter is most often seen as valuable in the context of isolating things that impact QoE, meaning a byproduct of the process of problem isolation.

Visibility (or observability, in modern-think) is a better way of talking about too little information, because quality of information is more important than quantity, the very point I opened with. Any sort of IT/network infrastructure will generate information on the state of its elements, and of course the more elements there are the more information is generated. In classic network/operations management, the process of problem resolution starts with getting some sort of alert, which might be telemetry or might be a complaint. This triggers an analysis to establish the nature of the actual problem, and that in turn leads to steps to resolve it.

“Automation” is a general way of describing the ability of an operations tool to take action without explicit human instruction. The action can be diagnostic or remedial, and “without human instruction” can mean fully autonomous or with human approval.

AI can be applied to this in a variety of ways, again both diagnostic and remedial. However, all types of AI rely on the information available being sufficient to first identify a problem source, and then to consider the impact of remedies on the ongoing operations state. So, of course, do human-driven processes, but “experienced” operations types often recognize subliminal patterns, and if we expect AI to do that we have to make the source of this intuitive thinking visible, which may well mean gathering stuff we don’t currently get.

But how do we do that efficiently? To know everything is, in a sense, to know nothing, because you can’t cull everything-ness in any reasonable period of time using any reasonable set of resources. This raises, in my mind, the missing point—context.

Context is the set of relationships that exist in any physical system. If Thing A in some way relates to Thing B, then the relationship is a part of the context of each, and of the system they’re a part of. A simple way to understand the importance of context is to consider an application. There’s a user, and there’s application code. The basic relationship is that the user initiates something and the application returns a response. Dive down and this input-process-output model consists of the handling of the elements (input, process, and output) by a variety of resources. The pathway that binds it all is what we usually call a “workflow”.

The workflow concept of context is likely a big part of what those “experienced” operations types have, that those with less experience don’t. If automation is going to work, then the automation element has to have contextual awareness, both to diagnose something and to assess the impact of possible responses. You can get this by having contextual relationships defined explicitly (a “digital twin” does this), by introducing trace points, or through observation or the analysis of historical data. The latter two are the knee-jerk artificial-intelligence or machine-learning solutions.

This aspect of observability is less an issue for operators, at least in a direct responsibility sense, since they only run the network infrastructure and not the hosted components. However, they have the problem of scale; there are way more elements under their roof, and many uncoordinated sources of traffic. In the end, this can shift the focus from completeness of information to information saturation, and in the case of operators this also creates a contextual problem. You have to consider the “state” of the network, and you have to worry about the behavior of the network in the current state, the goal state, and any states that it takes on during the transition.

There are problems with having AI/ML create context, according to both enterprises and operators. First, historical data is often not available, or not consistent, or not fully reflective of current reality. This makes it difficult for AI/ML to train on it. Second, there are differences between things that are correlated and things that are causal. One amusing enterprise example that I heard about was a system that recommended, when users reported a fault, turning on the lights. That turned out to be because having the lights turn on was seen as a condition that preceded normal operations. Understandable, but not only not helpful but actually contributing to a rejection of AI by workers! Third, as the number of elements in the network increases, and as the behavioral variations possible for the continuum of network users increases, the number of states increases, and the chances of “learning” all the possible states decreases.

The real issue with automating operations these days is really how to integrate AI into it. Yes, the challenges to AI cited by enterprises and operators in my chats with them, and in the referenced article, are real. There is an observability problem, but it makes zero sense to even attempt to solve it unless it’s done in an AI context. Otherwise, all we’ve done is bear-goes-over-the-mountain-style advances. AI is the final mountain.

I think that some variant on the digital twin, or state/event table/graph is the right solution. Both provide a mechanism for embedding context into a data collection and recognizing the inherent statefulness of real-world processes. Digital twins are best for large systems, state/event concepts fit better in smaller ones, or perhaps even as components in a digital twin. I also think that some smaller systems might be possible to address through AI/ML without a helping contextual partner element, but I wonder if that condition isn’t transient; the expanding notion of the missions of AI suggest that today’s smaller systems are simply elements of a future large one.

Salesforce has launched a YAML-based initiative to bring context to AI (“Snowflake”), but it’s not far enough along for me to be able to assess its value in offering contextual operations support. There’s no inherent reason why it couldn’t, but the goals of the process so far seem aligned more with the use of AI by enterprises in non-operations missions.

Missions are our big problem, because missions set requirements, and absent a clear vision of missions, it’s difficult to set firm technical specifications. “I can’t define my needs in an explicit, quantitative, sense,” one COO told me without a hint of irony or embarrassment, “but I know when they’re not being met.” Not a surprise given that “experience” is inherently subjective, but surely a challenge for traditional network and IT operations techniques. Inherently, one would think that artificial intelligence would be able to glimpse our subjectives, too.

We may be at a point where we have to abandon a lot of the cherished precepts of operations management, to reflect the reality that what drives all these processes is inevitably complaints, meaning experience issues, and to reflect that automation has to look to AI/ML as an essential element, if it’s to be effective, responsive, and efficient in the way it addresses our increasingly complex tech future.

Email and RSS:

Our Commitment: All the Facts, Always the Truth