Back in the days of the public switched telephone network, everyone understood what “signaling” was. We had an explicit signaling network, SS7, that mediated how resources were applied to calls and managed the progression of connections through the hierarchy of switches. The notion of signaling changed with IP networks, and I’m now hearing from operators that it changes even more when you add in things like SDN and NFV. I’m also hearing that we’ve perhaps failed to recognize just what those changes could mean.
You could argue that the major difference between IP networks and the old circuit-switched networks is adaptive routing. In traditional PSTN we had routing tables that expressed where connections were supposed to be routed. In IP networks, we replaced that with routing tables that were built and maintained by adaptive topology discovery. Nodes told each other who they could reach and how good the path was, simply stated.
The big advantage of adaptive routing is that it adapts, meaning that issues with a node or a trunk connection can be accommodated because the nodes will discover a better path. This takes time, to be sure, for what we call “convergence”, meaning a collective understanding of the new topology. The convergence time is a period of disorder, and the more complicated convergence is, the longer that disorder lasts.
SDN sought to replace adaptive routing with predetermined, centrally managed routes. The process whereby this determination and management happens is likely not the same as it was for the PSTN, but at least one of the goals was to do a better job of quickly settling on a new topology map that efficiently managed the remaining capacity. The same SDN central processes could also be used to put the network into one of several operating modes that were designed to accommodate special traffic conditions. A great idea, right?
Yes, but. What some operators have found is that SDN has implicitly reinvented the notion of signaling, and because the invention was implicit and not explicit they’re finding that the signaling model for SDN isn’t fully baked.
Some operators and even some vendors say NFV has many of the same issues. A traffic path and a set of feature hosts are assembled in NFV to replace discrete devices. The process of assembling these parts and stitching the connections, and the process of scaling and sustaining the operation of all the pieces, happens “inside” what appears at the service level to be a discrete device. That interior stuff is really signaling too, and like SDN signaling it’s not been a focus of attention.
It’s now becoming a focus, because when you try to build extensive SDN topologies that span more than a single data center, or when you build an NFV service over a practical customer topology, you encounter several key issues. Most can be attributed to the fact that both SDN and NFV depend on a kind of “out of band” (meaning outside the service data plane) connectivity. Where does that come from?
SDN’s issue is easily recognized. Say we have a hundred white-box nodes. Each of these nodes has to have a connection to the SDN controller to make requests for a route (in the “stimulus” model of SDN) and to receive forwarding table updates (in either model). What creates that connection? If the connection is forwarded through other white-boxes, creating what would be called a “signaling network”, then the SDN controller also has to maintain its signaling paths. But if something breaks along such a path and the path is lost, how does the controller reach the nodes to tell them how the new topology is to be handled? You can, in theory, define failure modes for nodes, but you then have to ensure that all the impacted nodes know that they’re supposed to transition to such a mode.
In NFV, the problem is harder to explain in simple terms and it’s also multifaceted. Suppose you have to scale out, meaning instantiate a new copy of a VNF to absorb additional load. You have to spin up a new VNF somewhere, which means you need to signal a deployment of a VNF in a data center. You also have to connect it into the data path, which might mean spinning up another VNF that acts as a load-balancer. In NFV, if we’re to maintain security of the operations processes, we can’t expose the deployment and connection facilities to the tenant service data paths or they could be hacked. Where are they then? Like SS7, they are presumably an independent network. Who builds it, with what, and what happens if that separate network breaks? Now you can’t manage what you’ve deployed.
I opened this blog with a comment on SS7 because one EU operator expert explained the problem by saying “We’re finding that we need an SS7 network for virtualization.” The fact is that we do, and a collateral fact is that we’ve not been thinking about it. Which means that we are expecting a control process that manages resources and connectivity to operate using the same resources it’s managing, and never lose touch with all the pieces. If that were practical we’d never have faults to manage.
The signaling issue has a direct bearing on a lot of the SDN and NFV reliability/availability approaches. You don’t need five-nines devices with virtualization, so it’s said, because you can replace a broken part dynamically. Yes, if you can contact the control processes that do the replacing, then reconnect everything. To me, that means that if you want to accept the dynamic replacement approach to availability management, you need to have a high-reliability signaling network to replace the static-high-availability-device approach.
Even the operators who say they’ve seen the early signs of the signaling issue say that they see only hints today because of the limited scope of SDN and NFV deployments. We are doing trials in very contained service geographies, with very small resource pools, and with limited service objectives. Even there, we’re running into situations where a network condition cuts the signaling connections and prevents a managed response. What’s going to happen with we spread out our service deployments?
I think SDN and NFV signaling is a major issue, and I know that some operators are seeing signs of that too. I also think that there are remedies, but in order to apply them we have to take a much higher-level view of virtualization and think more about how we provide the mediation and coordination needed to manage any distributed system. Before we get too far into software architectures, testing, and deployment we should be looking into how to solve the signaling problem.