Software-Driven Network Outages, Human Error, and Management

The recent Starlink service outage has raised yet again the issue of software failures in network outages. While over 80% of enterprises and slightly under two-thirds of operators say the largest source of outage-minutes in their networks is human error, most of these will admit that there’s an underlying question of whether software should have prevented or at least mitigated it. In any case, most of those who don’t cite human error cite software/firmware problems as their prime outage source.

Those who have followed my blog will surely recognize my favorite Latin quote: “Quis custodiet ipsos custodes”, which means “Who will watch the guards themselves?” This is pertinent in the outage challenge because the most-claimed solution to outages has for decades been software. Recently, it’s AI software. A couple dozen enterprises and operators have given the matter a lot of thought and even testing, and I think what they say on the topic is highly relevant.

The top point cited by these experts is that software service assurance of any sort has to be multi-layered. First, you have to build tools into network software that can recognize outages by observing operating state, and can then present an accurate picture of the problem to an operations center. Second, you have to have an independent monitoring and management tool that, while it relies on the underlying software for events and state, will then offer a systemic assessment. Why two layers? Because if software errors are a major issue, you have to avoid depending on the thing whose malfunction created an outage to then correctly identify and address it. Several of these experts talked about separating observability from analysis, and this seems to be a good characterization. But how, exactly, isn’t something experts agree on. In fact, they split almost evenly between two approaches.

One group believes that the second layer is an overlay on the first, decoupled from everything except the information sources, and working in parallel with the first layer, offering what in medicine would be the classic “second opinion”. This group is almost universal in believing that this top layer is based on AI. A majority of the group also believes that the AI layer should take as input any recommendations or analysis offered by the first layer, and thus point out the degree to which its results match that of the traditional management tools.

The second group sees the “second layer” more generally as the “n-th layer”, and envisions a hierarchy of tools, local “intent models” or management zones that are then linked into higher-level structures that culminate in the network overall. Each of the zones has control of things within itself, and each might be traditionally structured or AI, and might operate in conjunction with operator oversight or autonomously. The higher layers of the structure could be similarly framed.

The separation here seems more accidental than a result of a divergence in approaches, meaning that those who build network operations practices based on the presumption of singular network operations center control tend to fall into the first group, while those who manage networks in zones (data center, WAN, or whatever) or who have adopted autonomous aids or at least judgmental tools from their vendors for part of their networks fall into the second.

Neither of these groups believes that they have a complete solution to the outage problem, nor do they believe that anyone is offering them one. There are two things cited most often as missing.

The first thing, cited by everyone, is simulation. What they’d like to see is a management layer that takes a remedy—whether proposed by an autonomous AI agent or a human operator—and assesses what the state of the network would be if the remedy were to be applied. This tool would then warn of a solution that might be itself a problem, and perhaps even require a management override before the solution would be applied.

The second thing, cited by a third of the experts, is a network digital twin. This group says that the biggest problem in management is the fact that things done to one element of a network invariably impact other elements, and so the nature of the relationship of elements has to be understood in order to properly respond to problems. This group almost universally believes in simulation, but also believe that it can only be done through the mechanism of a digital twin. A few of the group suggest that what’s really needed isn’t a twin, but digital triplets, with the real network, a current-state model of it, and a model of the behavior with changes applied, being the trio. In normal operation, the two digital models are synchronized, and when a remedy is developed for a problem, the remedy is applied to the third model, so the impact of the remedy could be fully assessed. These experts also admit that there might be a whole series of “remedy-specific twins” spawned, to test various approaches and guide selection of the best one.

What can we learn from this? First, most operators and enterprises are still dependent on the old models of network management. Operators are more likely to recognize the device-network-service hierarchy in management than enterprises are (the enterprises tend to focus on device management). Operators are also more likely to view digital twinning and simulation as important than enterprises are.

Second, enterprises see AI as a singular solution rather than a component of a solution. They believe that somehow AI can do what’s needed if given an oversight role; they often don’t even question whether the AI strategy actually covers all their network or addresses all outage sources. Operators see AI in agent form, applied to address specific vendors, products, or problems. Their challenge lies in controlling how these agents cooperate to address the interdependence of network devices in the real world.

Even experts aren’t seeing a unification strategy for management, and that’s a problem. Networks are cooperative systems of devices, and there’s simply no way to effectively manage them without taking that interdependence into account. Are digital twins the right approach? Can AI eventually do that job for us? We don’t have an answer at this point as an industry, and we need one.

Email and RSS:

Our Commitment: All the Facts, Always the Truth