We’re hearing a bit more about a new AI concept, no surprise given the nature of tech hype, which always tries to trump itself to avoid being stale and losing people’s interest. This one might actually be important, though, and it’s also a flavor of a concept that’s been important all along, and that I’ve blogged about often. The new concept is the world model and I’d argue it’s a special case of the digital twin.
The root of all of all this is the likelihood that any major advances in tech dependence, business or personal, and tech spending are likely to come about by a tighter coupling between tech and human behavior. Today, we use tech, tomorrow it’s tightly coupled to our work and lives. The whole “screenless” concept depends, in the end, on having autonomous technologies doing stuff for us, not showing us things. In order for this close coupling to work, it’s essential that technology understand not what we say or ask, but how we work and live.
A digital twin is a model of a real-world system that’s synchronized with the real world via sensor data (IoT). In practice, these real-world systems are minimal subsets of the real world, things like an assembly line, a shop floor, a warehouse. If we want tight coupling with people, we need to expand the twinning to embrace at least a major subset of the world those people live in. Hence, a “world model”.
There is clearly no way today to model the entire world as a digital twin, and it’s very likely there never will be. However, we could surely model enough of the world to be able to support close integration of technology with lives and work. To do this, we’d draw in the fact that the world as we see it is a kind of hierarchy. We have rooms/spaces, buildings, cities, countries, continents, and so forth. If we were to try to build a world model for a piece of the world, the logical way to go about it would be to model lower-level elements and then couple them to create the next level. I’ve fiddled with this for over a decade, and it turns out to be a pretty complicated process.
In my most recent attempts, I found it essential to divide “things” in the world into two categories, what I thought of as “entities” and “spaces”. An entity is something that behaves according to its own internal rules and moves around in what would appear from the outside to be autonomy. People are entities, but so are things like vehicles of any sort, animals, the weather, etc. A space is something entities can inhabit. It has its own rules/properties and it might be a physical facility or simply a random collection of entities (think “flash mob”).
The autonomous or technology assistants that we’re trying to use are entities, and if they’re personal they combine with the people to create another entity layer, so there’s “us” and “augmented us”. This is also how a vehicle might be viewed; we have “car”, and “car-with-riders”. When an augmented entity is in a space, it is influenced by the behavioral rules there, and so influences the behavior of the entity. If the space is inhabited by multiple entities, how those entities influence each other is set by the rules of the space and of each of the entities. Thus, you can see that both entities and spaces are “containers” in a sense, with the distinction between them being that one can move around, change its space, and the other is environmental. But even this is simplistic, because a space might be a ship, which can move around, and a ship is also an entity in that it might contribute itself to the space “port”.
I eventually decided that “space” and “entity” were really just the top layer of a hierarchy. Spaces, then were logical collections of entities. Entities were functionally at least somewhat autonomous. Both had properties that would be “visible”, meaning that an agent process in a world model would have to be able to access them. What really separated them was that when two or more entities interacted, they did so within a “space”, which was then a common mission set that might have a physical manifestation (they were in a meeting room) or a logical one (they were on a video call, or perhaps cooperatively editing something).
All of this is complicated, but you can perhaps see how much more complicated it has to get. Entities have to be able to communicate with their spaces, meaning they have to share events, which might be status changes or instructions. That means interfaces with some standards to define data elements and formats. That, in turn, means some body to set the standards so that spaces and entities can work with each other.
That’s still complex. Any two container types that have to interact in any way need an API mechanism fully defined to support it. The best model for the structure would resemble that of the “Simple Network Management Protocol” or SNMP, which defined a hierarchy of data models, from the most general management elements you could associate with any network element, through device class and subclass, to proprietary extensions. One high-level must-support element here has to be a list of the standards the API can support.
Suppose two people want to co-develop a document or application. They create a “space” that includes all the APIs needed to collaborate, and move their virtual entities to it, hooking into its facilities and adhering to its rules. Any tools, like vibe-code agents, are also entities in this space, and through the space APIs they can work with these tools like they would another human entity. You can extend this to a model of a worker in a factory where the other entities are mechanical elements, to a road with other vehicles, to a virtual gathering of friends.
OK, but how does this relate to whether screen-less devices are the future, if indeed it does? Well, if we assume a future where cooperation, work, life, is managed even in part through a world model, it seems to me inescapable that we’d need to merge the digital world-model view with the real world. That means AR glasses to me. However, I don’t see how we could achieve complete world-model integration for well over a decade, even at best, and as long as that’s not done, an AR glass can’t pop up things to show us that might interfere with our real-world life. Yes, it’s possible that AR might signal us and we could elect to see the signal, but I can only imagine the lawsuits emerging because somebody made that election while driving, boating, or flying.
I think I learned a lot fiddling with this approach, which I tried in C++, Rust, and finally Java (because I’d used that language recently in my ExperiaSphere work), but I also learned that it’s simply not a one-person job to implement this. I estimate that just getting the basic framework is likely ten person-years worth of work, more than I can put in myself and something I’m not interested in funding. But a lot for a tech company, say an Nvidia? I don’t think so, which is why I’m hopeful something real will come of world models.
