Meeting the Operations Challenges of Virtualization

One of the most beneficial tech advances is virtualization. As the term is commonly used in both computing and networking, virtualization is a means of partitioning a resource in such a way that users of the resource see themselves as the only tenant. The result is lower costs with minimal (hopefully) QoE impact. Every enterprise I chat with say virtualization is a benefit, but three-quarters also acknowledge it creates operational challenges.

One obvious problem is that user experiences are essentially disconnected from reality. Years ago, an operations manager gave me a wry comment, “You can’t send a real tech to fix a virtual problem”. True, perhaps, but you do have to send someone, so what’s a “virtual tech” and what tools does the tech use? Good questions, but perhaps too abstract for operations types to deal with. Enterprises had their own more practical ones.

At or near the top was the simple challenge of complexity. Say you had a thousand networked servers running old-style monolithic applications. That in itself is complicated, but now suppose each was running four or five VMs or hosting that number of containers. Then suppose that the applications were more containerized because they were componentized. Using this example, ops professionals said that if they’d gone from a thousand to four or five thousand servers they’d have expected a major operations center upgrade, but over two thirds said the virtual explosion didn’t generate one.

Another top-level issue was that of ensuring that basic requirement I opened with, the requirement that virtualization appear invisible to users. This means not only must the initial resource partitioning truly isolate resource users sufficiently, there must be a way of ensuring that while individual users manage their apparently dedicated resources, they don’t interfere with others sharing the real resource. Interference here includes both risks to QoE and to security.

Both these issues are impacted by another issue, which is the interplay of flexibility/agility and interdependence. While virtualization facilitates resilience and scalability, realizing these benefits while retaining the efficiency benefit means managing what’s likely to become a highly variable workflow topology, and that means the needs on real resources vary with the way virtual ones are placed and used. The wider the scope of virtualization (network and compute, cloud and data center, and geographic) the more complex this interplay is to deal with.

OK, a lot of issues to consider. What to enterprises say about dealing with them? The first point that successful enterprises make is simple; you must separate the operations of real and virtual resources. Mingling responsibilities seems to end up with too much interdependence to deal with the three issues properly. The virtual team’s users are the real users, through the applications, and so the QoE they’re managing is user-related. The real resource team should not, enterprises say, deal with these users/applications in any way. Their “users” are the virtual team, and they don’t manage against QoE at all, but against SLA, the guarantees they make to virtual operations.

This SLA to QoE disconnect, intentional and even essential though it may be, has to be bridged, and that’s the second point. Just as an enterprise and a service provider would monitor the service SLA at access points, the virtual ops and real ops teams should monitor the SLA, which means that the real ops team needs to present resource state in an SLA-centric way. But what, exactly, does that mean?

You could argue that an SLA should represent the state of the pool of resources, and that meeting the SLA would include responsibility for shifting resource commitments when a local resource failed. The problem with this is best explained in the domain of Kubernetes and containers. Kubernetes controls deployment, virtual-world operation, and also redeployment and scaling. Kubernetes is the primary tool in container operations, and it arguably both creates the virtual domain of hosting and manages virtual-network connectivity. It’s inescapable that virtual hosting operations would require using Kubernetes, so if we allow it to be used by a real-resource team we’ve broken operations separation. But many, perhaps even most, real-resource faults can be detected in the virtual domain too, and of course they could also be communicated between the two teams. Most enterprises who advocated split operations said they preferred having server or data center network faults detected in the virtual domain, with real-resource operations focusing on maintaining the gear.

In networking, things are similarly complicated, particularly when you consider things like Kubernetes’ use of virtual networks, SASE, SD-WAN, and VPN services, and the fact that remote sites and workgroups often have private IP addresses that are similar to virtual networks. Address management in general is becoming its own issue set, to the point where many enterprises say that they prefer to use a commercial virtual-network tool for both server and workgroups, and assign even virtual machines permanent addresses.

In networking, for enterprises at least, the biggest challenge reported is understanding that address-to-entity relationship, which is critical in decoding reported QoE issues and also for many security tools. Where virtualization impacts this issue lies in how addresses are assigned and how the relationship to where-and-what (always a topic in network discussions) is maintained.

Almost half of enterprises say that their primary resource in identifying network issues is user complaints, but all say they also use telemetry from equipment. Linking user accounts with resource issues can be muddy when the user’s network address is hard to match to a path to resources, or hard to determine.

What I have to wonder is whether the problems here, and the entire issue of split operations, is a result of the combining of a revolution in virtualization with simple evolution of operations practices. A small number of enterprises say that in their view the whole issue of problem determination and isolation belongs in the virtual domain “because that’s where the users and applications are.” I think, from user perspectives, that this where thing are and should be heading.

Interestingly, this is substantially the same group of enterprises who say that selecting a single VPN that allows for fixed user address assignment (and of course fixed addresses for application hosts) is important. This suggests that a cadre of enterprises may have been exposed to the operational impacts of virtualization longer, or at a more intense level, though user accounts to verify this can’t be identified.

This approach focuses one team on physical hardware and platform software and it’s reported to work well for container users, but perhaps less so for enterprises who use VMs, and for users of both types there’s still the question of operations tools, which tend to be specific to the VPN technology in use, even for SASE/SD-WAN. We seem to be converging on an approach, but with a way go before we reach it.

Email and RSS:

Our Commitment: All the Facts, Always the Truth