One of the biggest hurdles in teaching robots new skills is how to convert complex, high-dimensional data, such as images from embedded RGB cameras, into actions that achieve specific goals. Existing methods typically rely on 3D representations that require precise depth information or use hierarchical predictions that work with motion planners or independent policies.
Researchers from Imperial College London and the Dyson Robot Learning Lab have revealed a novel approach that could address this problem. The “Render and Diffuse” (R&D) method aims to bridge the gap between high-dimensional observations and low-level robotic actions, especially when data is sparse.
R&D, detailed in an article published in the arXiv preprint server, addresses the problem by using virtual renderings of a 3D model of the robot. By representing low-level actions within the observation space, the researchers were able to simplify the learning process.
Imagining their actions within an image.
One of the visualizations the researchers applied the technique to was getting robots to do something that human men find impossible, at least according to women: lower the toilet lid. The challenge is to take a high-dimensional observation (seeing that the toilet seat is up) and combine it with a low-level robotic action (lowering the seat).
Explore technology explains: “Unlike most robotic systems, while learning new manual skills, humans do not perform extensive calculations to determine how much they should move their limbs. Instead, they typically try to imagine how their hands should move to effectively tackle a specific task.”
Vitalis Vosylius, final year of PhD. A student at Imperial College London and lead author of the paper said: “Our method, Render and Diffuse, allows robots to do something similar: 'imagine' their actions within the image using virtual representations of their own incarnation. Representing the actions and robot observations together as RGB images allow us to teach robots various tasks with fewer demonstrations and do so with improved spatial generalization capabilities.”
A key component of R&D is its learned diffusion process. This iteratively refines the virtual renderings, updating the robot's settings until the actions closely align with the training data.
The researchers conducted extensive evaluations, testing several R&D variants in simulated environments and in six real-world tasks, including removing the lid of a saucepan, placing a phone on a charging base, opening a box, and sliding a block towards a target. The results were promising, and as research progresses, this approach could become the cornerstone of developing smarter, more adaptable robots for everyday tasks.
“The ability to represent robot actions in images opens up interesting possibilities for future research,” Vosylius said. “I'm particularly excited about combining this approach with powerful image-based models trained on massive Internet data. This could allow robots to take advantage of the general knowledge captured by these models while also being able to reason about robot actions.” low level”.