If the biggest problem in information technology is complexity, could the biggest question be whether artificial intelligence in some form is the solution? We may have to answer that question as soon as this year, partly because the evolution of IT is bringing us to a critical point, and partly because the pandemic has raised some difficult issues.
In March, I asked a bunch of my enterprise operations friends what they believed the biggest source of new operations challenges were. I asked them to generalize rather than say something specific, but almost everyone gave me points of both kinds, so I’ll present the responses that way.
The biggest new source of challenges in a general sense was optimization of resources. The overall view was that it’s easy to run applications and support connectivity if you don’t care much about resource costs. You bury your problems in capacity. Ops types say that over the last decade, there’s been increasing pressure to do more for less, to reduce the relentless expansion in server and network capacity to contain costs. The result has been greater complexity, and more operations issues.
Take virtualization, the number one specific point ops types cite. We used to have a server, an operating system, and applications. Now we might have a server, a hypervisor to create virtual machines, deployment/orchestration tools, and an application. The result might be that we use servers more efficiently, but we’ve complicated what’s needed for an application to run. Complexity increases the management attention needed, the number of events to be handled, the number of ways that the pieces could be fit (many of which won’t work).
You might wonder why so many companies would decide to trade higher opex for lower capex, even to the point where the costs might exceed the savings. According to operations professionals, the reason is that operations organizations tend to be staffed to handle peak-problem conditions and to ensure a very high level of availability and QoE. The additional issues created by resource optimization are often handled, at least until a peak-problem condition occurs. Line organizations might tolerate erosion in availability and QoE. There’s residual capacity in most operations organizations, a safety factor against a major issue. Companies are knowingly or unwittingly using it up.
They’re also turning to a strategy that’s creating our second biggest challenge, point-of-problem operations aids. When an operations organization is overloaded, the problems don’t all hit at once. There are almost always specific areas where they manifest first, and when that happens even operations types think in terms of fixing the problem. If your pipe is leaking, you don’t consider replumbing your home, you fix the leak. The result can be a lot of patched pipes, and the same thing is true in operations organizations.
The problem is that every new and specialized tool introduces more complexity. Point-of-problem technology is what the name suggests, a tool focused on a specific issue. There may be related issues, ones that cause the target issue or ones that the target issue can cause. There may be issues that look like the target issue but are really something else. All these issue relationships are outside the scope of a limited tool, and so they’re handled by human agents. Operations resource needs then increase.
On the compute side, operations personnel cite virtualization again as a specific example of point-of-problem tool explosion, and because containers are such a focus right now, they cite container operations as their biggest concern. Containers are a feature of Linux. Container orchestration is handled by Kubernetes, but you have to monitor things so you need monitoring tools. Kubernetes may not be able to efficiently orchestrate disparate data centers or hybrid/multi-cloud without a federation tool. You might need a special version for edge computing. What tends to happen as container interest grows is that bigger and bigger missions are selected for containers, which create new issues, generate a need for new tools, and the next thing you know, you have half-a-dozen new software tools in play.
A big part of the expansion in tools relates to new missions, and the new missions create the third of our operations issue drivers; new features or requirements create complexity. Most network and IT operations organizations have seen a continual expansion in the features and services they’re required to support, required because new technology missions have created new demands. Ops types here cite security as the biggest example of this.
One ops type in the network space, talking about using firewalls and access/forwarding control to protect critical applications and information, noted that line organizations were told they had to advise operations of changes in personnel, roles, and in some cases even workspaces, to keep the access control technology up to date. “Every day, somebody forgot. Every day, somebody lost access, or somebody got access to what they weren’t supposed to. They told me it was ‘too difficult’ to do what they were supposed to do.”
Security also generates a new layer of technology, and often several different and interlocking layers in both network and IT/software. All this means more complexity at the operations level, and because security crosses over between IT and network operations, there’s also a need for coordination between two different organizations.
The pandemic is an almost-tie for security in this particular area of operations complexity generation. Application access, information relationships to workers, and workflow coordination through a combination of human and automated processes have all been impacted by the work-from-home impact of the virus. Not all companies are prepared to assume that they have to condition their operations to respond to another pandemic, or decide whether to try to improve their response ad hoc or create a new and more responsive model of IT. Over the rest of this year, most say they’ll make their decision, and implement it in 2021. That should give us a target date to solve the operations challenges we’ve raised here.
The common element here is that more stuff means more management of stuff, and most everything going on in networking and computing is creating more stuff. This stuff explosion is behind the interest in what’s called “operations lifecycle automation”. We see the most refined version of this in some of the modern DevOps tools, particularly those that use the “declarative” goal-state model of design. These tools can accept events from the outside, and use them to trigger specific operations steps.
In networking, state/event tables and events can be combined (as the TMF did with their NGOSS Contract work) to steer conditions to the processes that handle them. The TMF SID and the OASIS TOSCA models both support at least a form of state/event handling. My ExperiaSphere project presents a fairly detailed architecture for using state/event-centric service modeling to provide for automation of a network/service lifecycle.
Given all of this, you might wonder why we don’t see much happening around operations lifecycle automation in the real world. Over 80% of my ops contacts say they don’t use it significantly, if at all. Why?
The problem is that operations lifecycle automation based on goal-state or state/event handling is what you could call anticipatory problem handling. A state/event table is an easy example of this problem. You have to define the various states that your system/network might be in, all the events that would have to be handled, and how each event would be handled in each state. If we’d started our automation efforts when we first saw glimmerings of our management-of-stuff problem we’d have been fine; our knowledge and the problem could have grown together. Retrofitting the solution onto the current situation demands a lot of the ops organization, not to mention the tools they rely on.
Proponents of AI say that what we need here is some fuzzy logic or neural network that does what all those human ops specialists are expected to do, which is analyze the “stuff” and then automatically take the necessary steps in response to conditions. Well, maybe, but none of the ops types I’ve talked with think that something like this could be created within three to five years. I tend to agree. We may, and probably will, reach a point in the future where machine judgment can equal or surpass human judgment in operations tasks, but we can’t wait for that to happen.
The real hope, the real possible benefit, of AI would be its ability to construct operations lifecycle automation policies by examining the real world. In other words, what we really need is machine learning. State/event processes, and even state/event automation of operations processes, are based on principles long used in areas like network protocols and are well understood. What’s needed is definition of the states and the events.
In order to apply ML to operations, we would need to have the ability to look at the network from the point where current operations processes are applied—the operations center. We’d have to be able to capture everything that operators see and do, and analyze these to correlate and optimize. It’s not a small task, but it’s a task that’s clearly within the capabilities of the current ML software and practices. In other words, we could realistically hope to frame an ML solution to operations automation with what we have today, providing we could monitor from the operations center to obtain the necessary state and event information.
This is where I think we should be focusing our attention. There is no way of knowing if we could create a true AI “entity” that could replicate human operations judgment at any point in the near future. We could likely apply the ML solution in 2020 if the industry decided to put its collective efforts behind the initiative. ETSI’s Zero-Touch Automation (ZTA) group might be a place to do that, but we already have a term for the general transformation of ML principles to the real world—Machine Learning Operations or MLOps—and vendors like Spell, IBM, and Iguazio have at least some capabilities in the space that could be exploited.
The biggest barrier to the ML solution is the multiplicity of computer and network environments that would have to be handled. It’s not going to be possible to devise a state/event structure that works for every data center, every network, as long as we have different events, states, and devices within. Were we to devise a generalized network modeling framework based on intent-model principles, the framework itself would harmonize the external parameterization and event generation interfaces, define the events and states, and immediately present an opportunity for a universal solution.
Would it be better to wait for that? I don’t think so. I think we should try ML and MLOps on networks and IT as quickly as possible, so we don’t hit the wall on managing complexity down the line.