One of the joke statements of the virtual age is “You can’t send a real tech to fix a virtual problem!” Underneath the joke is a serious question, which is just what happens to test and measurement in a virtual world? Virtualization opens two issues—how do you test the virtual processes and flows, and how does virtualization impact T&M’s usual missions? We’ll look at both today.
Test and measurement (T&M) differs from “management” in that the latter focuses on the ongoing status of things and the reporting of changes in status. Management is status monitoring and not network monitoring. T&M, in contrast, aims at supporting the “craft processes” or human activity that’s associated with taking a refined look at something—is it working, and how well—with the presumptive goal of direct remediation.
Many people, including me, remember the days when resolving a network problem involved looking at a protocol trace, and that practice is a good place to start our exploration. Whether you have real or virtual devices, the data flows are still there and so are the issues of protocol exchanges. However, a virtual device is fundamentally different from a real one, and the differences have to be accommodated in any realistic model of T&M for the virtual age.
There’s an easy-to-see issue that we can start with. A real device has a location. A virtual device has one too, in the sense that it’s hosted somewhere, but the hosting location isn’t the same thing as the location of a box. A box is where it is; a virtual router instance is where it was convenient to put it. At the least, you’d have to determine where an instance was being hosted before you could run out and look at it. But that initial check of location isn’t enough in a virtual world. Imagine a tech on route to stick a monitor in a virtual router path, only to find that while in route, the virtual router “moved”. It’s common to have a soft collision between management-driven changes in a network and remediation, but in the traditional world the boxes at least stay put. T&M in a virtual world has to deal with the risk of movement of the instance while the tech is setting up or during the test.
Simplicity is comforting even when it’s not quite as simple as looks, but this simple point of “where is it?” isn’t the real problem. If software automation to improve opex is the goal (which operators say it is) for virtualization, then we’d have to assume that the goal is to move away from “T&M” to “management”, since the former is presumably explicitly a human activity. That means that in the future, not only would it be more likely that a virtual router got moved, it would be likely that if there were a problem with it the first goal would be to simply replace it—an option that’s fine if you’re talking about a hosted software element but problematic if you’re dealing with a real box. So, we’re really saying that virtualization first and foremost alters the balance between management and T&M.
When do you send a tech, or at least involve a tech? The only satisfactory answer in a time when opex reduction is key is “When you’ve exhausted all your other options.” One operator told me that their approach was something like this:
- If there’s a hard fault or an indication of improper operation, you re-instantiate and reroute the service as needed. It’s like saying that if your word processor is giving you a problem, save and reload it.
- If the re-instantiation doesn’t resolve things, you check to see if there was any change to software versions in the virtual device or its platform that, in timing, seem possibly related to the issue. If so, you roll back to the last configuration that worked.
- If neither of these resolves things or are not applicable, then you have to try remediation. The operator says that they’d first try to reroute or redeploy the service around the whole faulty function area and then try to recreate the problem in a lab under controlled conditions. If that wasn’t possible they’d assume T&M was needed.
The same operator says that if we assumed a true virtual network, the goal would be to avoid dispatching a tech in favor of some kind of testing and monitoring from the network operations center (NOC). The RMON specification from the IETF can be implemented in most real or virtual devices, and there are still a few companies that use hardware or software probes of another kind. This raises the question of whether you could do T&M in a virtual world using virtual monitoring and test injection, which would eliminate the need to dispatch someone to hook up an analyzer. A “real” dispatch would be needed only if there were a hardware failure of some sort on site, or a situation where a manual rewiring of the network connections of a device or server was needed.
One advantage of the virtual world is that you could instantiate a monitoring point as software somewhere convenient, and either connect it to a “T” you kept in place at specific locations, or cross-connect by rerouting. The only issue with this approach is the same issue you can run into with remote monitoring today—the time delay that’s introduced from the point of “tapping” the flow to the point of viewing the monitoring could be an issue. However, if you aren’t doing test injection at the monitoring point the issues should be minimal, and if you are then you’d need a more sophisticated remote probe to install so you could enter responses to triggers that are executed locally.
Another aspect of “virtual T&M” is applying T&M to the control APIs and exchanges associated with SDN or NFV. This has been a topic of interest for many of the T&M vendors, and certainly the failure of a control or management path in SDN or NFV could present a major problem. Operators, in fact, are somewhat more likely to think they need specialized T&M support for control/management exchanges in SDN and NFV than in the service data path. That’s because of expected issues with integration among the elements at the control/management protocol level.
Most of the technology and strategy behind virtual T&M is the same whether we’re talking about the data path or the control/management plane. However, there are profound issues of security and stability associated with any monitoring or (in particular) active intervention in control/management activity. We would assume that T&M would have to live inside the same security sandbox as things like an SDN controller or NFV MANO would live, to insure nothing was done to compromise the mass of users and services that could be represented.
Overall, the biggest impact of virtualization trends on T&M is the fact that a big goal for virtualization is service lifecycle automation. If that’s taken seriously, then more of what T&M does today would migrate into a management function that generated events to drive software processes, not technicians. In addition, the T&M processes related to device testing are probably far less relevant in an age where the device is virtual and can be reinstantiated on demand. But virtualization also lets T&M create what is in effect a virtual technician because it lets you push a probe and test generator anywhere it’s needed. Will the net be positive or negative? I think that will depend on how vendors respond to the challenge.