RT-2: Google’s New Breakthrough Explained

4 min readAug 10, 2023

The Rise of Robots That Can Understand and Act

The field of robotics has made tremendous strides in recent years. Scientists have gone from building robots that can perform simple factory tasks to creating ones that can perceive and navigate the world around them. Now, researchers from DeepMind have developed a new technique that takes robot intelligence to the next level. Their latest creation, called Robotic Transformer 2 (RT-2), can understand language, see its environment, and act accordingly. This fusion of vision, language and action represents an exciting milestone on the road to more capable and generalist robotics.

The Journey to RT-2

RT-2 builds on DeepMind’s previous work with Robotic Transformer 1 (RT-1). That model was trained on over 17 months’ worth of data collected by having robots repeatedly perform tasks in a kitchen environment. This allowed RT-1 to learn combinations of objects and skills, so that it could follow new instructions for familiar items. However, it struggled when encountering new objects or situations outside its training.

To improve on this, DeepMind leveraged the power of large vision-language models (VLMs) like PaLM and PaLI-X. These models are trained on massive datasets scraped from the internet, allowing them to recognize objects, understand natural…

RT-2: Google’s New Breakthrough Explained

Written by Cloud & Data Science