IBM just had a major cloud outage, and it certainly won’t help the company’s efforts to become the “cloud provider to the corporate giants” or maximize the value of its Red Hat acquisition. It also raises questions about whether there are complexity issues in cloud-building, and in cloud networking, that we may have glossed over.
The problem, according to a number of sources (like THIS one) was caused by an external service provider (an ISP) who flooded IBM’s routers with an incorrect BGP update. The source isn’t identified, nor has it been revealed whether there’s a suspicion of misbehavior or just sloppy management. The result was not only a widespread failure of IBM cloud services, but also the failure of many other sites and services that required something hosted on IBM’s cloud.
BGP’s vulnerability to hijacking is legendary, with its roots in the foundation of the Internet. From the first, Internet protocol designers presumed a “trust” relationship among the various operators involved. There’s minimal security built in, reflecting the fact that the early Internet was a network of universities and research labs that could all be trusted. We don’t have that situation today, and that makes the Internet vulnerable, not only at the level we often talk about—where users connect—but in the internals of the Internet.
The specific problem with routing protocols is that when another router advertises a route, there’s a presumption that they really have it and are presenting it correctly. There’s also a presumption that they will advertise routing changes only when routes are really changing. If a router either advertises a “false” route or issues a bunch of updates that don’t reflect real changes in the network, the result can be congestion in route processing, false paths, and lost packets. In the extreme case, you can get a complete collapse that could take hours (or even days) to sort out.
That we should be trying to eliminate a problem with literally decades of history behind it goes without saying, but since it has been a “legendary” problem, we’ve clearly let it go too long. These days, there are plenty of bad actors out there, and BGP vulnerabilities are demonstrably a good way for one of them to attack Internet and cloud infrastructure. There’s little point in my adding my voice to the chorus of the unheard on the topic, so let’s move on instead to the cloud itself.
Cloud security is typically taken to mean the security of applications and data in the cloud, and not the security of the cloud itself. There’s a profound difference, and as it happens the difference is the most critical in the area of “carrier cloud”, and in particular the area of carrier-cloud hosting of virtual network functions.
A “network” is a tri-plane structure—data, control, and management. In traditional IP, all three of these planes exist within the same address space, meaning that there’s a common IP network carrying all these traffic types. When we transport ourselves from a device-centric model of networking to a virtual-function model, we add in a fourth layer, which I’ll call the “mapping” layer. Functions have to be mapped to hosting resources, and the three layers of protocol that normally pass between devices must now be mapped to real connectivity. There’s a tendency to think of this mapping process as an extension of the normal management layer, and that can be the start of something ugly.
If the mapping layer of a virtual network (in NFV terms, things like the Virtual Infrastructure Manager, Management and Orchestration, and the VNF Manager) are extensions of the management domain of the services, then all these elements are addressable from the common IP address space. I have a device management port (say, SNMP) on a “virtual router”, and I likewise have a VIM port that lets me commit infrastructure. I can send an SNMP packet to the former, and presumably (if I have the address of the VIM) I can commit resources. Except that unless I’m the network operator, I shouldn’t be able to address the VIM ports at all. If I can, I can attack them from the services themselves.
Securing mapping-layer resources can’t be considered to be a task for access control or encryption (like https), either. Encryption can prevent me from actually controlling mapping-layer features, but I can still pound them with requests to connect, the classic denial-of-service attack. What I need to do is to completely isolate the mapping layer from the service layer. They shouldn’t share an address space at all.
The biggest unanswered question in cloud computing today is whether we have adequate separation between the service layer of customers, and the “mapping layer” where cloud administration lives. Without it, we have the risk of an attack on cloud administration, which obviously puts everything in the cloud at risk.
Even private cloud users, including container/Kubernetes users, should be thinking about the security of their mapping processes, meaning in this case Kubernetes’ own control ports. Generally, Kubernetes assumes that all its pods, nodes, etc. are within a single address space, that that address space is shared by everything, so every pod can network with every other one. Control ports are thus visible in that address space. This might be a good time to think about having separate address spaces for the control ports.
Another thing we need to think about is the risk associated with cloud-hosted components of applications or services. There’s a general view among cloud users that dividing work among multiple cloud providers improves reliability, but the opposite is true. At the fundamental level, an application that’s spread across all of three (for example) different hosting points (two public clouds and a data center) needs all three to be working, which means the reliability is less than it would be if everything was hosted in one place. In order for our trio system to be more reliable, I need to be able to redeploy between the hosting points to respond to the failure of one. If you hosted a part of your application in the IBM cloud and another part in your data center, absent this redeployment, you would have a total application failure if the cloud failed (and also if your data center failed).
You also need to be thinking about how you control your multi-cloud or hybrid cloud. If you’re not careful in how your orchestration/management tools are addressed, you may end up putting your control ports on your overall VPN, which makes them even more exposed. Container/Kubernetes systems use Linux namespaces, and so the applications have some limitations in addressing built in. Mix in non-container, non-Linux, stuff and you may now have no control over that critical mapping layer.
Virtualization of any kind increases the “threat surface” of infrastructure; the more layers you have, the more there is to attack. These threats are exacerbated if care isn’t taken to separate the control network components from user services, or from anything that’s shared among multiple parties. Finally, if you can’t certify mapping-layer components and even virtual functions for security, you’ll never secure anything. If we’re not careful, we could be creating a major problem down the line.