There are a lot of interesting competitive battles fought outside public view, and one of them relates to the network technology to be used in connecting AI systems. NVIDIA, whose GPUs form the foundation of most major AI services, is an advocate of InfiniBand, and you could argue that this is either why they bought Mellanox (who makes adapters for the technology) or that they’re now advocating InfiniBand to justify this M&A. Recently, Ethernet advocates and the Linux Foundation launched an “Ultra Ethernet” initiative to “adapt and refine” Ethernet for AI and other high-performance computing applications. The new initiative is launching into swirling marketing and technical claims.
InfiniBand isn’t a new technology; it’s been around for a couple decades, and it was originally designed for local I/O connections (it’s a fusion of Future I/O and Next Generation I/O or NGIO). It was much faster than Ethernet when it came along, but Ethernet has largely caught up these days. It’s also a low-latency protocol, and that’s why it got a head start in AI applications, where latency could add up and increase solution times excessively. However, Ethernet is typically less expensive and it’s being continually developed. For many (myself included) Ethernet is a much smarter choice, and its popularity would seem to make it a slam dunk.
What’s at stake here isn’t so much the sort of Ethernet applications we’ve seen evolving over three or so decades. Ultra Ethernet Transport (UET) aims at cluster networking, which for AI and HPC is essentially an evolution of data center networking. If we assume that generative AI is more than an enormous hype wave doomed to break on the rocks of a business case scarcity, then a big part of “hyperscaler” data centers are likely to start building their local networks to support AI. Since InfiniBand has NVIDIA’s support, that means it could well win, and vendors like AMD, Broadcom, Cisco, Intel, Juniper, and others are very likely to lose out because their data center strategies are built around Ethernet. Some of these are founding members of the new group.
There’s nothing wrong with a healthy and cynical driver in our current tech market, but there’s also a pragmatic and even altruistic motivation in play here. If a proper UET can be defined, one that preserves as much of Ethernet as possible, it would facilitate the networking of AI/HPC clusters with the rest of the Internet and VPNs. It might even revolutionize data center networks, offering better latency control and higher speeds when loads build to the point where the current Ethernet connections are taxed. Competition in Ultra Ethernet would likely control costs because the new specification would be fully open and its Ethernet roots would make market entry easier.
Even today, it’s arguable whether InfiniBand is justified. One question already being asked by HPC users and even by members of the consortium is how much AI/HPC dependence on InfiniBand’s latency and performance is more convenience than necessity. Could we have biased how we build the hosting platforms for generative AI based on the recommendations of NVIDIA, who inarguably has a vested interest in InfiniBand? Could a different structure for the hosting framework, a different distribution of work, create an implementation that could be supported not only by UET but also by faster versions of today’s Ethernet? As we’ll see, even some NVIDIA people say that’s likely.
One way to figure out whether a legitimate technical interest exists in a market space is to look at the competitive dynamic there. Perhaps the most interesting thing about the Ultra Ethernet group is member Broadcom and their jousting with NVIDIA. Broadcom is the heart and soul, literally, of white-box networking products with their line of switching chips. They’re not an InfiniBand giant, and their Jericho3-AI switch is an InfiniBand competitor. Competitor NVIDIA’s Spectrum-X is claiming to be a “lossless Ethernet” low-latency technology for the masses even without UET, and NVIDIA’s network division spokesperson has said that InfiniBand is really for a small number of massive-workload applications like GPT hosting, but Ethernet is better for most commercial cloud/cluster applications. It’s almost like the network side of NVIDIA is at odds with the AI side.
And that may be the case. What seems to be happening here is a combination of a high-visibility marketing food fight combined with an underlying technology shift. How many enterprises say they’re using InfiniBand? None told me that spontaneously, and I suspect that even a survey aimed at getting a real number wouldn’t be able to justify any statistically significant results. How many find Ethernet is running out of gas? Only 7% of the users who told me about data center constraints listed Ethernet capacity as one, and less than 2% of those seemed to think that they were already at the top of the current Ethernet capability set. Most simply needed faster adapters/switches. But AI hosting is hot, and if the premier chipmaker for AI hosting is pushing InfiniBand, then there’s a media battle to be fought, reality notwithstanding.
I do not believe that the number of applications and clusters that justify InfiniBand is likely to generate any broad change in the networking market and its associated opportunities. I don’t think even NVIDIA believes it will (because one of their network people said as much), and that’s why they’ve pushed on Spectrum-X. Broadcom also believes in Ethernet, but they believe that an open, high-performance, Ethernet specification is needed because of the growing complexity of compute clusters. But if everyone is really singing Ethernet’s praises, why is Broadcom part of the new UET group and NVIDIA is conspicuous in its absence? It may lead back to that marketing war. Right now, an AI service means NVIDIA GPUs and often Mellanox InfiniBand adapters. Right-now revenue makes every company happy, happier than potential future revenue. NVIDIA likely fears that joining the Ultra Ethernet group would validate an InfiniBand competitor and overhang current opportunity.
Broadcom sees Spectrum-X as a competing technology, not InfiniBand. Broadcom’s response is to say that Spectrum-X is just a rehash of what Broadcom has already done with its own Jericho3-AI and Tomahawk5 switch chips. They may realize that NVIDIA will hang back on UET because of their current InfiniBand opportunity. By joining the Ultra Ethernet group, then, Broadcom puts pressure on NVIDIA. Join, and everyone will say that InfiniBand is old technology that NVIDIA is moving away from. Don’t join, and everyone will say you’re trying to create a proprietary advantage rather than an open solution.
To get to the reality of all of this, we have to go back to basics. Spectrum-X was announced this spring by NVIDIA for AI missions, and it’s a collection of products, an ecosystem of chips that are pretty obviously intended to pull each other through. If they’re doing all that Spectrum-X says, and promoting InfiniBand only for massive workloads, aren’t they admitting that Ethernet is the right answer? If that’s the case, then they should join the Ultra Ethernet group and make the best of it. Whatever pressure that move might put on their sales of Mellanox adapters in the near term won’t hurt them more than their own comments in support of the Spectrum-X launch.
All this is interesting in terms of market dynamic, but the biggest question is how Broadcom might play this if their VMware acquisition is approved (and it appears to be gaining on an approval recently). If there’s any real innovation in Spectrum-X, it’s the notion that a hardware ecosystem of processing and networking creates a lossless network. Such an ecosystem cries out for a collateral software ecosystem, a platform for the applications with APIs and tools. VMware would be ideal for that. Broadcom might have platform stars in their eyes.
NVIDIA may have to face that potential competitive threat, beyond eventually giving in to UET. Broadcom with a platform would mean NVIDIA needs one too. They have a developer program, of course, and they’ve done some very innovative things with their own software development teams. Could then decide to create a platform for a tight integration of compute and network? Sun used to say that “The network is the computer.” Could NVIDIA and Broadcom think the computer is becoming the network? NVIDIA might even back into it through the APIs they are already supporting, and new ones added to create or encourage that compute/network integration. All of this could be very interesting, first in creating a strong integration between data center or cluster connectivity and WAN connectivity, and then perhaps in making the network look more and more like a chip processing bus. I think watching the UET developments might be very smart right now.