Google DeepMind’s robotics team is teaching robots to learn the way a human intern would: by watching a video. The team has published a new paper demonstrating how Google’s RT-2 robots with the Gemini 1.5 Pro generative AI model built in can absorb information from videos to learn how to navigate and even carry out requests at their destination.
Thanks to the Gemini 1.5 Pro’s wide context window, it’s possible to train a robot as if it were a new intern. This window allows the AI to process large amounts of information simultaneously. Researchers would film a video tour of a designated area, such as a home or office. The robot would then watch the video and learn about the environment.
The details of the video walkthroughs allow the robot to complete tasks based on the knowledge it has acquired, using both verbal cues and images. It's an impressive way to show how robots can interact with their environment in ways that are reminiscent of human behavior. You can see how it works in the video below, as well as examples of different tasks the robot can carry out.
The limited context length makes it challenging for many AI models to remember environments. 🌐With 1.5 Pro's 1 million-token context length, our robots can use human instructions, video walkthroughs, and common-sense reasoning to successfully find their way through a space. pic.twitter.com/eIQbtjHCbWJuly 11, 2024
Experience in robotic artificial intelligence
These demonstrations aren’t rare flukes, either. In practical tests, Gemini-powered robots operated in a 9,000-square-foot area and successfully followed more than 50 different user instructions with a 90 percent success rate. This high level of accuracy opens up many real-world usage possibilities for AI-powered robots, helping out at home with chores or at work with menial or even more complex tasks.
That’s because one of the most notable aspects of the Gemini 1.5 Pro model is its ability to complete multi-step tasks. DeepMind’s research has found that robots can figure out how to answer questions like whether a specific drink is available by navigating to a refrigerator, visually processing what’s inside, and then returning and answering the question.
The idea of planning and executing the entire sequence of actions demonstrates a level of understanding and execution that goes beyond the current standard of single-step orders for most robots.
Don't expect to see this robot on sale anytime soon, though. For one, it takes up to 30 seconds to process each instruction, which is much slower than doing something yourself in most cases. The chaos of real-world homes and offices will be much harder for a robot to handle than a controlled environment, no matter how advanced the AI model is.
Still, integrating AI models like Gemini 1.5 Pro into robotics is part of a major breakthrough in the field. Robots equipped with models like Gemini or its rivals could transform healthcare, transportation and even cleaning tasks.
You may also like