“How artificial intelligence is penetrating the physical world. And why the path is rockier than hoped.
The American tech companies Google, Meta and Nvidia are pushing forward "Physical AI" - so-called world models that predict not words but states and are thus pushing from the laboratory into factories, logistics and cities.
What already works? Where else is the problem? And why is the technology remaining in laboratory operation for the time being?
First of all: priorities are obviously shifting. Google is reorganizing its AI research more around robotics and embodied systems. The Facebook group Meta brings in an external manager and places him above the long-time AI chief scientist Yann LeCun. Nvidia boss Jensen Huang no longer just talks about computing power, but about "Physical AI" - artificial intelligence with a body. The tech industry in Silicon Valley is looking for the next growth driver.
The core of this movement is simple: chatbots like ChatGPT predict the next word.
World models, in turn, predict the next state of an environment.
That sounds like a nuance, but economically it means a new order of magnitude: The digital economy moves enormous amounts of money - the physical economy, i.e. factories, logistics, agriculture and construction, moves many times that amount. Anyone who reliably brings AI into real processes accesses a market that affects almost the entire value chain.
“LLMs do not understand the physical world, have no permanent memory, cannot reason reliably and cannot plan,” says Yann LeCun. For the head of Meta-research, these deficits mark the limits of today's language models. They are supposed to draw conclusions about world models: They are supposed to learn how objects move, collide, and break - and from this they can construct an internal picture of cause and effect that is consistent over time.
The conditions have only recently been met: enough data, enough capital, and above all, enough computing power. The timing is no coincidence. Text, image and code models have completed their first product cycle. The industry needs a new growth story. World models provide them. Whether the technology meets the demand is now being decided.
A world model works fundamentally differently than a language model. ChatGPT learns statistical patterns from billions of words.
A world model, on the other hand, learns from videos, images, sensor data and 3D scans how the world behaves - not "which word comes next?", but "what happens if I push the cup?".
Four dimensions mark the distance to chatbots: Firstly, the so-called modality. World models are multimodal. They primarily process visual input, sometimes also audio or proprioception (self-perception of body position and body movement). Language alone is not enough to understand spaces. Secondly, it is about temporality. Language models generate tokens by tokens. World models must represent dynamics: a ball continues to fall, friction slows down, fluid flows - consistently over many time steps. Thirdly, embodiment plays an essential role. World models are often linked to agents, robots, drones, vehicles. Outputs are then not just text, but motor commands, paths, gripping movements. Finally, there is a difference in data scale and diversity. Internet text is plentiful but comparatively homogeneous. Spatial intelligence requires images and scenes from all angles, light and weather situations. The company Niantic - known for Pokémon Go - has scanned ten million locations worldwide over the years, adding around a million new posts every week. Only such data sets allow robust spatial representations.
The idea is not new. Back in 2018, David Ha - then a researcher for Google Brain - and the German AI pioneer Jürgen Schmidhuber showed how an agent learns a compact internal simulation of a game world and first runs through possible actions "in his head" before acting.
Today, much larger data sets and computing resources are available. The experts at the AI company Deepmind, which is part of Google, summarize the goal as follows: World models are intended to simulate aspects of the environment and thus predict how an environment will develop - and how your own actions will change it.
However, the reality remains stubbornly complex. Niantic reports more than 50 million separately trained networks with a total of 150 trillion parameters for its “Large Geospatial Model”. This illustrates the scale and shows why it is more difficult to depict the world than to examine text.
Simulations are intended to close the gap. "We invested a decade in Omniverse and physical simulators because we knew Physical AI was coming," says Nvidia's Rev Lebaredian. Simulators generate virtually unlimited data, safely, diversely and faster than reality. But the simulation reality gap remains: What runs stably virtually sometimes fails on real ground, real light, real disruptive factors.
After years in the laboratory, there is tangible progress - modest but measurable. Three fields stand out:
1 Robotics in the laboratory: At the end of 2024, Deepmind introduced Gemini Robotics 1.5 - two AI models that let robots plan and act. In demos, they sort garbage according to local recycling rules that they previously researched online themselves, they pack suitcases based on a weather forecast, they sort laundry by color without being specifically programmed for each piece of clothing. The technical structure is divided into two parts: A model (Gemini Robotics-ER 1.5) plans in natural language, breaks down tasks, and retrieves information. The second (Gemini Robotics 1.5) performs - as a vision-language-action model that controls movements. The division of labor between “brain” and “body” facilitates specialized training. The hard number: 20 to 40 percent, that's how high the success rate in tests is. In one experiment, the robot found the rules of the city of San Francisco and sorted accordingly. He was only completely correct in about a third of the cases. This is not enough for unsupervised use. Google is therefore initially only opening Gemini Robotics to selected partners. It's a pilot project, not a product yet.
2 Virtual worlds as a test field: Deepmind's AI model Genie 3 generates interactive 3D worlds from text input - accessible in real time. Scenes maintain visual and physical consistency over minutes: light, object positions, water flow remain consistent. The aim is to create a generic world model that will serve as a training environment for agents in the future. The limits: After a few minutes the consistency breaks and the focus is more on simulation than on goal-directed behavior. Computational effort and hardware requirements are high.
Start-ups are also driving development: The young company Decart raised around $100 million in 2024 and is valued at $3.1 billion. The Oasis and Mirage models create and manipulate video in real time - including live streams that can be redesigned using text input. The applications range from games and live entertainment to synthetic training data for autonomous systems. In practice, however, there are artifacts, flickering and a high inference effort - technically impressive, but still fragile.
Runway, an American start-up for generative video software that is already used in Hollywood productions, is experimenting with computer game worlds: text input creates environments, characters and branching dialogues, the system maintains a consistent world state, events change scenes, characters remember. To put it into perspective: Today Runway primarily supplies production tools (previs, marketing assets, short clips), Game Worlds is an experimental field between content creation and agent training. The potential is great, the practice is still rudimentary. The advantage: extensive, interactive behavioral data is created that is valuable for later agents.
3Geospatial intelligence: Niantic has used its own games to build up an enormous world memory, consisting of millions of locations, with millions of scans added every week, including 3D point clouds. This creates so-called large geospatial models that locate devices with centimeter precision. For AI agents, this promises a robust sense of place - recognizing where you are, what lies around the corner, what invisible sides of buildings probably look like. The limitation: It is primarily perception and localization, not action intelligence. Planning and control come first. In addition, world knowledge is aging: cities are changing, data maintenance remains complex. Nevertheless, Niantic is committed to providing this basis for spatial AI - and is being rewarded by investors for it.
But why is it so much harder to develop world models than the now established large language models (LLMs)? A problem already described is the low reliability. 20 to 40 percent success in laboratory tasks is far from industrial requirements.
In factories, in traffic or in the home, “five nines” are needed - reliability close to 99.999 percent.
Current models do not cover the “long tail” of unforeseen situations. Data and simulations are a problem. The physical world is more varied than text. Real data collection is expensive and risky. Simulation provides supplies at scale, but remains incomplete. Standardization (such as OpenUSD) helps, but the gap between simulation and reality does not disappear overnight.
And then there are the costs. Training and inference - i.e. executing an already trained model on new inputs - are extremely computationally intensive. On-device GPUs, cloud connections, latency, energy consumption are all contributing factors. Efficiency advances in both software and hardware are necessary to turn pilots into products. Security also remains an issue. Mistakes in the physical world have consequences. The implementation cycle takes longer than in pure software.
In addition, a world model alone is not enough. It requires perception, planning, control, language. It remains to be seen whether modular systems (“brain/body”) or end-to-end approaches will ultimately be effective. The vision of bringing AI to the entire real economy is plausible - but the path is measured in years, not quarters. Capital is moving ahead, technology is catching up.
And what strategies are the leading companies pursuing? Google and Deepmind are integrating world models into the Gemini family. The leitmotif is "Reason before motion" - a planning model thinks, an execution model acts. The reuse of the same basic model across text, images and robotics promises generality, but access remains selective for the time being. At the same time, Google is investing in generative worlds (Genie 3) and evaluation benchmarks.
AI competitor Meta takes a two-pronged approach, relying on large language models and Yann LeCun's world learning from video (V-JEPA). Organizationally, the top of the AI team has been reorganized, which underlines the aim of continuing on both paths to “superintelligence”. Meta's advantage lies in devices (AR/VR) and scaling, its disadvantage in not having much of its own robotics hardware - the focus is on software agents with an understanding of the world.
The AI chip company Nvidia, for its part, supplies the substructure - chips, Omniverse as a simulation and digital twin platform, standards such as OpenUSD. Nvidia’s goal is to provide the workbench for “Physical AI” that others build on. The more training and testing that takes place in high-quality digital twins, the more central Nvidia's stack becomes.
And then there are Open AI, Tesla, Amazon and Apple: Open AI is moving into robotics through investments in humanoids (1X). Tesla is scaling road world models from car data and wants to transfer them to humanoids. Amazon is professionalizing logistics robotics and could integrate world models into devices and warehouses. Apple is AR-focused and committed to standards - a quiet candidate for spatial AI on devices.
In short: nobody wants to miss the connection. The strengths are spread - research (Google, Meta), platform (Nvidia), product ecosystems (Apple, Amazon), data (Tesla). The field is open enough that combinations will count, language models that can see and act, and world models that can explain.
The big promises lie in manufacturing, logistics, mobility, healthcare and the household sector. Before it gets there, AI is maturing in forgiving domains: in video games and movies (generative worlds and characters), in simulation (digital twins that pre-optimize factories), in semi-controlled robotics use cases (inspection drones, service robots in structured environments). A clear development is emerging: games serve as a bridging technology. Interactive worlds create dense behavioral data that trains memory, planning, and adaptation - exactly the skills real-world agents need. At the same time, the tool ecosystem is becoming more professional. Simulators are getting better, 3D formats are becoming standardized, and inference is becoming more efficient. Only when success rates noticeably increase and costs fall does the technology leave the sandbox. Until then: deliver where mistakes can be tolerated.” [1]
1. Welt-Modelle sind die nächste Front der KI. Frankfurter Allgemeine Zeitung; Frankfurt. 06 Oct 2025: 18. Von Marcus Schuler, San Francisco
Komentarų nėra:
Rašyti komentarą