AI Operations Tools are Taking Off: To Where?

One early point of application for AI in general, and AI agents in particular, is network/IT operations. In the last six months, 154 enterprises have told me their own interest in this area has increased dramatically, and the number who have adopted or say they will adopt AI in that role has doubled. What’s behind this? There clearly has to be some specific mission or missions driving the new interest in AIops. Those 154 enterprise can shed some light on the matter.

By a slight margin (76 to 66) enterprises said their top mission for AIops was “reducing errors”. Second place was “reducing downtime”, and third (and last, with 12 mentions) was “faster problem response”. Obviously there’s a relationship between these three, and thus the ranking is really an indicator of how they view operations problems. Of the group, 88 have a specific idea of what that is, and 66 are outlook-oriented rather than focusing on what they’re responding to.

Everyone agrees that the enemy overall is complexity. One CIO said “there’s nothing in my infrastructure that hasn’t gotten more complicated in the last decade, and nothing I don’t think is headed for more complications in the next one.” The number one source of the complexity problem is componentization of applications, though about a third of enterprises will specifically say “the cloud”. The point is that if you build up applications from components, then deployment and operationalization is more complicated, and network connectivity is as well. Just figuring out what’s supposed to be happening is a burden, and deciding what should be done is piled on top of whatever you think is the root cause. Then it’s implementing that decision without unintended consequences that hits you.

I think the last paragraph explains a lot. Enterprises who have looked deeply into the cause of network issues realize that what’s going wrong is almost surely a slip in one of those three steps, meaning an error. Only about 8% of enterprises are seeing their root problem being an “original” problem like an equipment or service fault; most realize that the real problem lies in accuracy of response. That, I think, makes AIops an ideal solution.

But what do they expect AIops to do? Here we have some issues, because the “literati” of the group (44 of 154) say that the first and critical need is to establish a picture of the operating states and their validity, where the rest really offer no specific pathway to achieving the first goal.

Complex tech systems, meaning IT and network infrastructure and all the components that make up applications, have many operating states, meaning many combinations of conditions that could be expected to exist. Some of these states allow “normal” operation, meaning they sustain the collective mission of the applications, while others are fault states that do not. Most, say the 44 literati, fall somewhere between. The literati say that the first of our three steps, figuring out what’s supposed to be happening, is really a matter of determining the current state and classifying the result into one of the three categories I just related. This provides a specific mission for AIops, which is essential if you expect to find a tool to do what you want.

The second step, deciding what should be done, is actually potentially more complex. The literati generally agree that given that the state of the system under control is determined by the collective QoE of its users, it’s not difficult to decide where to look to assign state. On the other hand, the elements of our application system, business-wide, are potentially highly interdependent. The literati offer a multi-phase approach to this step. First, you decide the range of things that return the system to an acceptable operating state, then you pick the one that has the best combination of positive and negative impacts. This, they see, as a predictive process that might either be an AI analysis of the new-state the remedy targets against past history, or a simulation.

The third step, implementing the decisions necessary to bring about the new state the previous step identified, is also complicated, according to the literati. Problem resolutions almost always have to be implemented in a specific way, a specific sequence, or they fail. Enterprises say that the largest number of mistakes made in this particular step of activity is a failure to do things in the right order. How that it determined is also a matter of prediction, again requiring either an AI recommendation based on past history, or a simulation.

What we have here, then, is something with multiple complex steps, so it’s no wonder enterprises are looking first and foremost to reducing operations errors. The problem, say the literati, is that it’s very difficult to apply AIops the way it should be used to accomplish what enterprises are looking for. The problem goes back to the issues of agency, and a related area we could call “jurisdiction”.

Find an AIops tool today, and you almost always find a tool with a very specific, very limited, scope. Let’s say you have a netops AI tool. Obviously, it handles network operations, and it’s likely that it does that for a single vendor and maybe even a single product area, like wireless LAN or data center networking. The literati agree that while limited-mission AIops like this are useful, they’re not as useful as AIops overall would be. The reason is simple; in the inter-reactive world of IT these days, almost nothing you do at the network level can be assessed for mission value in isolation. You might fix a network problem to break something elsewhere, and with worse impact. “You need to keep looking at QoE overall,” said one operations specialist, “not just network availability or QoS.”

Enterprises overall, including the literati, say that they are not being offered a truly systemic AIops strategy today. They also agree that this isn’t a barrier to their adopting some of the limited AIops they are offered, especially at the network level, but they do say that the limitation has an impact on what they’re prepared to let AI do.

Agent AI, applied to just a part of operations, isn’t considered something enterprises are generally willing to accept in autonomous response form. None of the literati, and only 38 of the 154 who responded on AIops, said that they’d likely let a limited AI tool actually implement its recommendations. Thus, the value of AIops might be limited by the limits in its scope.

While vendors might love to fix that, it’s far from clear who’d step up. You’d need a strong position with data center hardware and platform software, and in network equipment, to be an ideal candidate. HPE/Juniper, were it to survive the DoJ challenge, might well be the candidate, and if they were they’d probably induce others to look at the space as well. It could add a little interest to the increased competition in the equipment space. Nothing does that as well as a potentially powerful differentiator, which full-scope AIops would surely be.

While you can’t get as much out of a limited-scope AIops as you might like, you can still get enough. Enterprises are generally happy with the tools they have, even those who wished they could do more. AIops is a good example of an agent-AI model, and it’s also likely a good example of an application that would benefit from a system of linked AI agents to increase its scope. I think that’s where we’re headed.

Email and RSS:

Our Commitment: All the Facts, Always the Truth