One of the problems with hyped technologies is excessive breadth. Hype is a form of mass hysteria, and to achieve it you need masses, which means you tend to generalize to increase the chances a given theme will intercept interest profiles and get clicks. The problem is that there’s often a specialized piece of a hyped technology evolution that could really be important, but is missed because of its very specificity. So it is with Nvidia’s CES position and its Cosmos world foundation models.
In AI, a foundation model is a deep learning (which is in turn a subset of machine learning) model that’s pre-trained on a specific set of data and is usually then applied by being further specialized by the user. Popular public generative AI is an example of a foundation model trained on a vast array of knowledge—like the Internet. So is AI applied to programming, or tax accounting, or business analytics. Cosmos is a what NVIDIA calls a “world foundation model” designed to generate representations (models, images, videos, sounds) of the real world. In a very real sense, the goal of Cosmos it to build a digital twin.
The greatest challenge in integrating technology with real-world activity is modeling the real world in a way that enables an application to analyze conditions and generate appropriate responses. Such a model is usually called a “digital twin”, a term that’s perhaps a bit misleading because it’s not normally a complete real-world replica but rather a replica of a real-world process subset. NVIDIA, for example, says that Cosmos was trained on twenty million hours of video of people doing things. Think of it in NVIDIA’s term; “physical AI”.
Cosmos is a model set or family, with three members. Nano is a low-latency, resource-optimized, model designed for at-the-edge deployment. Super is the baseline model for centralized operation, and Ultra is the high-fidelity optimized version that would, for example, be used in the creation of custom models. All are trained on real-world activities involving people doing things, including moving, driving, and working. The models can be used to analyze video information (real-time or stored), build real-world models, predict future states based on training data, and run simulations for optimization.
It’s hard to overstate the importance of foundation models in general, and of Cosmos in particular. I’ve blogged many times on the importance of creating new tech benefits by bringing tech into our real world, on the role digital twins necessarily play in that, and on the fact that we lack a singular digital-twin framework. Cosmos can address all of that. IBM, the strategy leader in enterprise AI according to enterprises, has this take.
The goal of Cosmos isn’t to convey something as much as to represent the physics of an activity. That representation can, of course, guide the creation of a video so that it shows a generated example of the activity, but it can also be used to analyze, model, and influence the activity. Cosmos can provide as close to a complete solution to real-world tech integration as I’m aware of today.
In the workplace, for example, Cosmos could analyze the process of loading trucks with boxes, and from this could recommend the optimum approach, direct workers to apply the approach, or control robots to apply it. It could, in theory, do that for any physical process it had been trained on, or could be trained on. It could also assess the outcome of changes made to the activity, which means that it could potentially forecast whether an in-progress action could/would result in an unsafe or undesirable outcome.
In the real world, Cosmos’ potential is even more revolutionary. Obviously, autonomous operation of vehicles would be facilitated, perhaps to the point where true full-time self-drive would be safer than human operation. Even pedestrian movement, from shopping to hiking and mountaineering, could be changed forever, providing that the inputs could be provided.
Cosmos illustrates that video is the ultimate means of machine-learning the real world, which means that full-on next-gen applications will rely on cameras and AI analysis of the feeds. I think it’s clear from the three-level Cosmos model structure that NVIDIA presumes that we’d have some sort of AI hierarchy, with simple and (relatively) inexpensive Nano models doing local analysis and looking for patterns, and deeper Super and Ultra models mediating that information to derive information and forecast outcomes. The structure overall could be reasonably easy to deploy in work-related, facility-limited, applications, but the cost for society-wide applications would be considerable, and would raise the dual questions of funding and social/political acceptance.
Enterprise comments on Cosmos so far suggest that manufacturing, transportation, warehousing, refining, and similar sectors could deploy it profitably without a major effort, providing that some vendor or integrator did the specific modeling work and that the in-house expertise needed (which, so far, they can’t assess) could be acquired. Public safety and military applications are currently seen as the most fruitful paths toward a society-wide system, because of the need to pre-position the assets over a wide area for effective exploitation.
There are examples offered by enterprises for Cosmos applications that integrate a limited real-world video analysis with other data to generate something useful. For example, body camera video might be analyzed in combination with mapping data to guide first responders to a specific site, or plot a route for safe withdrawal. Jobsite video, vehicular video, and other “point video” sources could also be enough for some applications, and it may be that these applications would become the proving grounds for broader use. However, enterprises aren’t familiar enough with Cosmos for their views here to be authoritative in the near term; experience will be required, and those who have expressed early interest didn’t think they’d be doing more than kicking Cosmos tires in 2025.
I expect that they’re right, too. The challenges here are significant, even with Cosmos. As I’ve already noted, enterprises think they’ll need time to develop AI experience, and Cosmos alone demonstrates that enterprise AI is still a moving target. What does seem clear is that Cosmos and world foundation models, combined with the notion of AI agents (I dislike how “agentic” is getting spread out to things it’s not really applicable to), can create a way of building applications that are somewhat insulated from the technical evolution of AI. There’s already a good article available on how agent-oriented AI will impact APIs, for example, showing software architects are thinking about the challenge.
It’s been my long-held view that public, generative, AI aimed at personal productivity in document or email generation is an ROI dead end. Cosmos-like agent AI is not. Did the big cloud types know this was in the works, and so justified their capex on AI with the popular hype Wall Street loves, rather than trying to explain Cosmos and its timing and risks? Or did they believe the hype? I expect that we’ll find out pretty quickly now that Cosmos is out.