My blog on Cisco/Juniper announcements, and my earlier blog about the fall planning cycle for 2023 and its relationship to operations models, made me think about what kind of network and IT operations strategy might actually serve enterprises best. Only about a third of enterprises even say they have a unified model for management of network operations, and only 5% claimed any unified network/IT model. Frankly, I doubt the real number is even half that.
It’s hard to say just exactly why we’re having such little success with a unified operations view, but my feeling is that a big part is the fact that “operations” and “management” tends to focus on the old FCAPS (fault, configuration, accounting, performance, and security) acronym. When applied by technical teams, this tends to mean thinking mostly in terms of parameters and alerts. These are specialized to the infrastructure they’re associated with, meaning that they’re separated by hosting versus network, software versus hardware, and so forth.
What I hear from enterprises, by a margin of almost 3:1, is that most of their operations activity is related to user support, meaning that rather than being “event driven” in a technical sense, it’s complaint-driven. This, I think, is because of the transformation from building infrastructure to combining services. Networking is a combination of VPNs and SD-WAN (or SASE) and computing is the cloud, the data center, applications (self-developed and third party), and the users or employees aren’t able to parse the relationships to make a specific complaint. Instead they tend to report in terms of what they can’t do. Think “I can’t get on the network” or “I can’t access my application”.
The same shift in how users/workers are supported impacts the operations staff. Ops professionals tell me that if they get a user complaint, their first step is to check the state of their resources to see if there are any problems currently detected. Many will admit that if there’s something already known that’s reasonably associated with the reported problem, there’s a strong tendency to say “We’re working on it.” To actually try to isolate the problem, these professionals will usually ask a few questions or check a few displays to drive a first-level fault isolation. Is the problem one app or all apps, one user or everyone at a given location? They’ll use this top-down approach to align the likely problem with a specialist or management toolkit.
“Monitoring” and “management” are terms that in ops-talk tend to be associated with a resource-side approach to things. As almost 90% of line organizations and almost 40% of operations professionals believe, a demand-side approach would map better to the complaint-driven manner in which operations has evolved. This, I think, is the main driver behind the concept of “observability”. Yes, many say that the goal of observability is to get actionable data, but most of the products are really focused on getting application-side information given that the users are in almost all cases trying to use applications. If application performance could be tracked (“observed”) then you should be able to link what you find to user quality of experience. If the linkage is done right, then you should be able to isolate the cause to at least a specific service or element.
The goal in observability, wherever it’s started, is to get as detailed as possible a picture of a workflow associated with a user/application relationship. If we envision this workflow as being a series of message hops that starts with a request, traces through network, application, hosting, database, and other fulfillment elements, and then returns as a response, what we’d like to be able to do is to track and time units of work along the flow to establish where something is getting delayed, corrupted, or lost.
The primary challenge to observability is the visibility granularity within a given workflow. Imagine a user directly tied to a computer, running a monolithic application. Where can you intercept the work? The issues associated with probing for work detail have resulted in a whole industry of software companies who offer probe technology to introduce into software, creating work visibility (well, shall we say “observability”) where otherwise things would be opaque.
This illustrates the challenge of observability, and also calls out two basic approaches to it, which link back to the first blog I reference. You can go top-down, and intercept where there’s a point available, and by doing that you can at least get a handle on a lot of applications. You can also start at the bottom, by working to create points where work can be monitored, and build upward to a system that incorporates all that data and draws insights from it. I’ve suggested that Splunk/Cisco took the first path, and that Juniper/Apstra took the second.
There’s another potential element in the Juniper arsenal if they choose to exploit it. Their acquisition of 128 Technology gave them what they now call “session-smart routing” or SSR. Session awareness, meaning the ability to recognize a specific user/application (or application-application) relationship to apply policies to it, would create a linkage between what the network knows and what applications and users are doing. That linkage could then contribute analytics to the overall picture, which means it could enhance observability.
I’m making that point not to pick a winner in the Cisco/Juniper face-off, or even to decide the issue of top-down versus bottom-up. My goal is to point out that the greatest challenge to observability is the very thing that’s been driving it at the positioning level; fragmentation and virtualization. We have broken up the fulfillment of user experiences into many different and disconnected pieces, so it’s hard to see the whole picture. We’ve created abstractions on which fulfillment of application and connection requirements depend, and by their very nature these abstractions hide things. QoE is the collective result of cooperative elements, and it depends on their actually cooperating. Since the range of “reasonable” behavior of any single element includes operating modes that may not represent cooperation, we need observability first and foremost to observe it all and report on the state of the whole. The winner in that space will not be the player who spins the most buzzwords and follows the current media hype, it will be the player who sees all, and knows all.