Network Architecture, Complexity, and Reliability

Did we have it wrong when we said some things were “too big to fail”? The flood of network/Internet problems that have cropped up, including the recent Verizon mobile outage raise, in the minds of some enterprises and consumers, and even some telcos, whether networking is getting too big to succeed.

One enterprise network planner complained last week that “We’re approaching spending more on network security than on the network.” Another said “Our relationship with our customers depends on the Internet. Our workers depend on the Internet. And the Internet isn’t as dependable as it used to be.” I think you can make an argument that these are extreme views, for sure, but I also think many believe that the frustration behind them is both real and justified.

Of 607 enterprises who commented on WAN reliability since January of 2025, 488 said that it was “noticeably worse” and 111 said it was “becoming a serious problem.” In my circle of personal acquaintances, probably four out of five (that’s a ratio, not a total!) think that the Internet today is more problematic than it was just a couple years ago. If you look at the Cisco ThousandEyes Internet report for January 12-18, you see that ISP outages in the US were up 36% and cloud outages by 16%. The prior week showed a 50% and 54%, respectively, increase. It’s not hard to see why people and companies are a bit concerned. And easy to see why they’d want answers, but can we give them? Only in part.

The big problem, according to my contacts in the telecom and ISP world, is the growing complexity of the thing we call “the network” or “the Internet”. Neither the public network, the telco network, nor the Internet is a simple set of point-to-point services like networks were decades ago. Today, all modern network services are “smart” to a degree, meaning that they contain elements of computer/software intelligence. They are also largely multi-layered composites, with hosted functionality only indirectly related to “connections” included. All this additional stuff is essential if we’re to make networks more mission- and user-classification-capable, but all of it can break.

It’s a matter of compound probability. Suppose you have something that works 99% of the time. That’s not five-nines, but it’s pretty decent. Now suppose you have ten things, each with a 99% reliability, that are essential in making up the service. The service is now only 90% reliable. Every critical feature we add to a network, everything we can’t live without, makes it more probable that the service can’t live without it, and that the chances the service will work are reduced.

It gets worse, too. If there’s a single critical thing that fails (like the Cloudflare failure, or like a critical piece of mobile-functionality supposed to have failed in the Verizon case), then you could take down a lot of the network, and it could stay down for a long time. Not seconds, but minutes or hours…and potentially days.

This is true at the high level, with features and layers, but also at the device level. The chances of a network 500 devices with five-nines availability all working is 99.5% but double that and the chance is near zero. That’s why you need things like alternate routing and sophisticated netops. So spread networks out, add devices, and even if all the devices are simple, the network is less likely to be fully functional.

If the networks of today were as simple as the networks that existed when I did my first data communications project, I wouldn’t be writing this for publication online and you’d not be reading anything that way either. Utility almost demands complexity.

Complexity demands components that make it up, and if every component is driven by software, then software issues will be a universal source of risk. Back in the days of “plain old telephone services” or POTS, the primary telephone switching software (called “generics”) were updated twice a year. Today, we have essential components of various layers of the network that are updated several times a week, and a few that (for some periods) are updated daily. Many enterprises say that it’s no wonder there are problems when things are changed at that pace, but of course we wouldn’t have rapid development technologies for software if applications didn’t also change at a blistering pace. And, since “the network” is to most workers and consumers “everything not on my desk or in my hand”, all that stuff we connect with is often tarred with the network brush, and its failures become network failures.

The information age, the age we live in, is complicated overall. Technology dependence raises technology risk, as all forms of dependence do. The core of the problem, IMHO, goes back to the need for speed, combined with the need for software.

Back when I got into programming, a huge corporation who (as most did then) trained their own programmers, found that roughly one person in fifty could be made into a programmer, even if you pre-selected by advertising for programmer trainees. The problem isn’t so much the “coding”, writing code, it’s designing software. Back then, software designing and coding were usually a single job, but that’s not possible today Software architects, people who can do the designing, are much more rare than programmers, but you need architects to plan the software or programmers have no instructions to work from, just as you can’t expect contractors to wing a skyscraper. The only way to avoid the pitfalls of complexity is planning, and the combination of a scarcity of architects and the need for speed has resulted in applications and network software that has outrun good design.

Look at mobile network standards today, and you see an approach that would appall most software architects. Part of that is because involved in writing standards are software architects, which makes no sense considering how much network/service functionality lies in software. But even cloud giants, who today hire what’s surely a large percentage of the total base of software architects, are seeing more and more examples of uncontrolled complexity. What’s the solution?

We need to look not at vibe coding but at vibe architecting. We should be able to teach AI how to do that, but nobody has rushed to tell me about their projects to create an AI entity that can do software architecture. We also need to look at AI to architect networks, even suggest the architecture model for mobile standards. Planning ahead, which is what architecture mandates, is what’s needed to make sure that our efforts to make networks optimally useful to the most people doesn’t make them too sketchy for those people to rely on.

Email and RSS:

Our Commitment: All the Facts, Always the Truth