"RT-2, our new vision-language-action model, helps robots
more easily understand and perform actions — in both familiar and new situations.
For decades, when people have imagined the distant future,
they’ve almost always included a starring role for robots. Robots have been
cast as dependable, helpful and even charming. Yet across those same decades,
the technology has remained elusive — stuck in the imagined realm of science
fiction.
Today, we’re introducing a new advancement in robotics that
brings us closer to a future of helpful robots. Robotics Transformer 2, or
RT-2, is a first-of-its-kind vision-language-action (VLA) model. A
Transformer-based model trained on text and images from the web, RT-2 can
directly output robotic actions. Just like language models are trained on text
from the web to learn general ideas and concepts, RT-2 transfers knowledge from
web data to inform robot behavior.
In other words, RT-2 can speak robot.
The real-world challenges of robot learning
The pursuit of helpful robots has always been a herculean
effort, because a robot capable of doing general tasks in the world needs to be
able to handle complex, abstract tasks in highly variable environments — especially
ones it's never seen before.
Unlike chatbots, robots need “grounding” in the real world
and their abilities. Their training isn’t just about, say, learning everything
there is to know about an apple: how it grows, its physical properties, or even
that one purportedly landed on Sir Isaac Newton’s head. A robot needs to be
able to recognize an apple in context, distinguish it from a red ball,
understand what it looks like, and most importantly, know how to pick it up.
That’s historically required training robots on billions of
data points, firsthand, across every single object, environment, task and
situation in the physical world — a prospect so time consuming and costly as to
make it impractical for innovators. Learning is a challenging endeavor, and even
more so for robots.
A new approach with RT-2
Recent work has improved robots’ ability to reason, even
enabling them to use chain-of-thought prompting, a way to dissect multi-step
problems. The introduction of vision models, like PaLM-E, helped robots make
better sense of their surroundings. And RT-1 showed that Transformers, known
for their ability to generalize information across systems, could even help
different types of robots learn from each other.
But until now, robots ran on complex stacks of systems, with
high-level reasoning and low-level manipulation systems playing an imperfect
game of telephone to operate the robot. Imagine thinking about what you want to
do, and then having to tell those actions to the rest of your body to get it to
move. RT-2 removes that complexity and enables a single model to not only
perform the complex reasoning seen in foundation models, but also output robot
actions. Most importantly, it shows that with a small amount of robot training
data, the system is able to transfer concepts embedded in its language and
vision training data to direct robot actions — even for tasks it’s never been
trained to do.
For example, if you wanted previous systems to be able to
throw away a piece of trash, you would have to explicitly train them to be able
to identify trash, as well as pick it up and throw it away. Because RT-2 is
able to transfer knowledge from a large corpus of web data, it already has an
idea of what trash is and can identify it without explicit training. It even
has an idea of how to throw away the trash, even though it’s never been trained
to take that action. And think about the abstract nature of trash — what was a
bag of chips or a banana peel becomes trash after you eat them. RT-2 is able to
make sense of that from its vision-language training data and do the job.
A brighter future for robotics
RT-2’s ability to transfer information to actions shows
promise for robots to more rapidly adapt to novel situations and environments.
In testing RT-2 models in more than 6,000 robotic trials, the team found that
RT-2 functioned as well as our previous model, RT-1, on tasks in its training
data, or “seen” tasks. And it almost doubled its performance on novel, unseen
scenarios to 62% from RT-1’s 32%.
In other words, with RT-2, robots are able to learn more
like we do — transferring learned concepts to new situations.
Not only does RT-2 show how advances in AI are cascading
rapidly into robotics, it shows enormous promise for more general-purpose
robots. While there is still a tremendous amount of work to be done to enable
helpful robots in human-centered environments, RT-2 shows us an exciting future
for robotics just within grasp." [1]
1. Robotic Transformer 2 (RT-2): The Vision-Language-Action
Model
kyegomez / RT-2 Public
https://github.com/kyegomez/RT-2
Komentarų nėra:
Rašyti komentarą