In a couple of my recent blogs, I talked about Nokia initiatives that might (directly or indirectly) jump-start the fusion of AI and the metaverse. That could launch a boom in tech, a pathway to enhancing the productivity of a whole class of workers we’ve been ignoring, the ones that aren’t sitting at desks. But Nokia’s not a marketing master, and the task might require that skill.
In another blog, I posited that Apple’s AR/VR initiative could, in some direct or indirect way, be the first step in a new IT cycle, the first step toward another tech boom. Apple is a marketing master, but there’s a long and winding path from AR/VR to a whole new productivity paradigm. The question is how that path might be walked at the detail level, and what other technologies might facilitate the shift. In an earlier blog I talked about the chance that AR/VR, AI, and the metaverse might all be involved. How would that work and who might win as a result? That’s the topic of this blog.
What AR/VR does is to provide an immersive visual experience. The key word here is “immersive” because any monitor will provide a visual experience, but the experience is small relative to a user’s view of the real world, and two-dimensional when our view of the world is three-dimensional. It follows that the applications that would benefit from AR/VR are the ones that benefit from the immersive property. The things that facilitate those applications are the things that build value for an AR/VR ecosystem. We know that 3D video content is striking with the new technology, but creating that stuff is difficult and expensive. AI might help in that, but just how much entertainment alone could drive AR/VR is hard to say, and surely there are other valuable missions out there.
I think how they develop will determine how influential Apple’s initiative will be and how profitable it will be for apple. Since Apple alone isn’t likely to be able to move the mountain of AR/VR ecosystem development, we have to try to work out just what might evolve. From that, we may be able to at least determine the opportunities most likely to drive things forward.
Broadly speaking, AR/VR divides just as the diagonal divides the two acronyms. Augmented and virtual reality, in other words, represent different things, but what things? There are multiple views on this. The classic conception is that augmented reality starts with a real-world view, what a camera could see and convey to the goggles, and adds in stuff that helps interpret (augment) the real world. VR, in contrast, doesn’t relate to the real world but to a construct that we visualize in real-world terms. But this presents some obvious challenges, such as those of a virtual meeting. If I build the meeting from the images of real people, as we might today in a Teams or Zoom meeting, we aren’t really being immersive, because each of the attendees is in a separate real-world visual context. If we build the meeting from avatars in a virtual room, and if the avatars are faithful representations of the real attendees, is this AR or VR? If we go further and let the people pick the avatars that represent them, is that AR or VR?
Also broadly speaking, both AR and VR, to be immersive, have to be realistic. That means that the visual experience has to be fairly high in resolution, and the visual field has to change as we move our eyes, head, and body, and at a rate and in a direction that’s appropriate. Since our movement, in the real world, would be in three dimensions, that means that the image we see in the goggles has to be created in three dimensions and all three have to be visualized realistically in light of our movement.
How this is done isn’t easy to map cleanly into the two popular acronyms we have in place. To resolve the confusion here, we’ll have to invent or adapt some terms. Let’s start at the high level with the notion that we can divide goggle experiences into “representational”, “controlled” or “computational”, meaning that they can directly correlate with real-world activity, be manipulated by the user to appear real-world, or be generated and projected in some way into our visual field, respectively. Each of these has challenges.
Representational applications are clearly a form of AR. We appear to have discarded the notion (at least for the moment) that AR allows the mixing of a direct optical/eye view of the real world and a view of additional virtual elements. Instead, we assume that the real world is captured via camera(s) and then mixed with those additional elements. This category is then subdivided into “real context” and “virtual context”, where the former means that the perception of the user in the real world is the visual context, and the latter that the context of the experience is created. The difference is in how things like meetings would be handled.
When we use Teams or Zoom for virtual meetings, we are not obtaining an immersive experience. We are simply creating a class picture, a mosaic of images of the attendees. If we wanted to make the experience immersive, we’d need to either make it appear that all the attendees were in the same “real room” as each person, or create a single “virtual room” into which we could project the attendees.
We could also consider driving as an application. In theory, we have the same options as we had for meetings, but in this case it’s pretty clear that the driver’s immersive experience has to be tightly coupled to the driver’s real context.
This means that the context issue is really decided by the role of the user’s real context in relation to other autonomous elements, like other people. If the user is simply in a room, as they would be in a virtual meeting, and the other autonomous elements don’t share the user’s real context, then any interactions are virtual and wouldn’t actually impact the real world. If the user shares a space with other autonomous elements, then that space has to form the context because real-world interaction is possible.
I think this is really the core of the “what’s an immersive AR/VR application” question, because it sets the framework for the inherent complexity of implementation, and how that complexity divides among the elements. It also lets us relate back to the three experience classes, “representational”, “controlled” or “computational”.
Real context demands representational implementation. Whether we’re talking about a warehouse goods movement application or an autonomous vehicle on a road, we need to have the stuff that share’s the world with our user represented faithfully, because real-world interaction is not only possible, but likely intended to a degree. We can probably assume that the application’s immersive experience is essentially what the user would see if they took off the goggles, meaning that the augmentation is intended to interpret and guide the user. The logic of the application then interprets conditions based on real-world state. That would surely include and likely be dominated by the need to interpret something captured by camera, meaning video real-time analysis. The object in front of you is real, and if your motion takes you forward you are going to hit it, so augmentation means, in this case, warning of risk. A key point here; the user of the application and every other contextual element represents themselves.
Virtual context means that the user and other autonomous elements inhabit a virtual world but don’t share real-world context. Our hypothetical Teams or Zoom meeting is an example. To make it immersive, we have to make it appear that we’re actually meeting, which means constructing a context that’s shared virtually. In that context, each attendee is represented by an avatar, and so each user may see “through the eyes” of that avatar but each user is themselves an avatar too. We can’t rely on cameras to create the virtual context, but we have to make each autonomous element, each attendee, realistic enough that the overall virtual context is immersive and credible. This means that we have to synchronize users with avatars, either by making the avatar do what the user is really doing, or what the user wants the avatar to be doing, and we have to model the three-dimensional context so it can be represented consistently to all. The big technical challenge is creating, maintaining, and representing the shared virtual context.
Another dimension to this virtual context, which is when there are no other autonomous elements sharing context with the user. The user is interacting with purely virtual elements. A single-player game and a movie are examples, with the former being one where the user is an observer with no real autonomy, other than perhaps “turning” or “looking” in a different direction. In the latter case, the user can “move” and thus influence what the immersive visualization shows. In these applications, the world-model is static or controlled by software, and the visualization of the model is all that changes. This is the “computational” class of experience.
Sending pixels to an AR/VR goggle isn’t a big technical challenge. The challenge lies in what to send. In representational applications where real-world camera views make up most of the visual context, you have requirements that are more related to assessing conditions than in creating anything, and the problem isn’t much different from the driver aids in modern vehicles. My car will pace other vehicles, steer as long as I keep a hand on the wheel and the lanes are visible, and brake if something obstructs the path. That sort of assessment doesn’t require major tech innovations (it might eventually benefit from some, though).
Virtual context is another matter, because what we send to the goggles has to be based on a model that has to be built and maintained through the movements of those elements that are autonomous or controlled, and including those that are software-generated to appear autonomous. The visual context will depend on where everything is relative to each user, and what each autonomous or apparently-autonomous thing is doing, including the users themselves.
It’s virtual context that introduces the opportunity (or even requirement) for AI and metaverse or digital twin technology. Any virtual-context application will require a model of the virtual context that can then be used to render user perspective and visualization. Think of this as being like a CAD system, where a two-dimensional rendering of a part on a monitor requires three-dimensional modeling and the concept of a viewpoint. If the virtual context includes autonomous or apparently-autonomous elements capable of motion within the realm of the virtual context, then there has to be a way of determining the position of each element. If that position or behavior is tied to the owner of the autonomous element (another user, for example) then the motion/position might have to be captured in real time and synchronized. That would also be true if the element were simply a part moving on an assembly line or on a conveyor, a vehicle, etc. The more coupled the behavior of elements in the virtual context must be with the real world, the more the model of the virtual context looks like a digital twin.
The metaverse concept ties into this because a metaverse model is designed to create a virtual context, which means it can visualize a digital twin. If we’re talking about AR/VR driving change, then whatever else is going on will have to contribute to immersive visualization, so it may be smarter to think about digital twins as being an application of a metaverse model than the other way around. Certainly if Apple is creating a major drive for AR/VR applications, we could expect that to be the case. That, in turn, argues for thinking about the metaverse as a kind of middleware with applications built on top, and whose features include the reception of telemetry from the real world (joystick inputs, mouse inputs, sensors, etc.) and the integration of that telemetry with the elements of the virtual context that represent the sources.
I doubt that Apple and Nokia are going to collaborate on this. They have two very different perspectives on what a winning approach would look like. The question, the real question, is whether the metaverse middleware is more social, more industrial, or more general. If Apple drives it, the first of the three is the most likely, in which case the realization of a new wave of IT investment might be delayed. If someone else drives it, that would mean that Apple’s own benefits may be delayed.
The industrial metaverse is a concrete application set, which means its supporting technology is pretty easy to visualize. Nokia is already advancing it. AR/VR is a supply-side innovation, meaning that in order for it to achieve its goals for Apple (being the “next iPhone”) there will have to be a lot of supporting development done by others. Apple will drive things toward the social, consumeristic, and visual. Nokia will drive things toward business and technology. We may see which approach wins by the fall.