Introduction
The vision of multitasking robots, like the iconic Rosie from "The Jetsons," has been a long-standing aspiration in robotics. However, training general-purpose robots remains a significant challenge. Traditionally, engineers collect data specific to a robot and task in controlled environments, a process that is not only expensive and time-consuming but also limits the robot's ability to adapt to new tasks or unanticipated environments during training.
Drawing Inspiration from Large-Scale Language Models
MIT researchers have proposed a novel approach to overcome these challenges, drawing inspiration from large-scale language models like GPT-4. These models are pre-trained with massive amounts of diverse data and then fine-tuned with a small amount of task-specific data. This strategy allows the model to adapt and perform well on a variety of tasks, thanks to the vast knowledge gained during pre-training.
In robotics, data is highly heterogeneous, ranging from camera images to proprioceptive signals that monitor the position and velocity of a robotic arm. Furthermore, each robot has unique mechanical characteristics, such as a different number of arms, grippers, and sensors. The environments where data is collected also vary widely. To address this heterogeneity, an architecture is needed that can unify these diverse data types into a format understandable to the robot.
The Heterogeneous Pretrained Transformers (HPT) Architecture
The MIT team developed a new architecture called Heterogeneous Pretrained Transformers (HPT) that unifies data from multiple modalities and domains. At the heart of this architecture is a machine learning model known as a transformer, the same type that forms the basis of large language models.
The researchers aligned vision and proprioception data (cawareness of posture, movement, various parts of the body and changes in balance, as well as including sensations of movement and position of joints) into a common input type, called a "token," that the transformer can process. Each input is represented with the same fixed number of tokens, allowing the model to process information from different sources uniformly. The transformer then maps all inputs into a shared space, growing into a massive pre-trained model as it processes and learns from more data. The larger the transformer becomes, the better its performance.
Advantages and Performance of HPT
One of the main advantages of this approach is that a user only needs to provide a small amount of data about the robot's design, configuration, and intended task. HPT transfers the knowledge gained during pre-training to learning the new task. This makes the training process faster and less expensive, as it requires much less task-specific data.
When tested, HPT improved robot performance by over 20% in both simulated and real-world tasks, compared to training from scratch each time. Even when the task was very different from the pre-training data, HPT still showed significant improvements. This indicates a remarkable generalization capability, crucial for robots that need to operate in unpredictable environments or perform previously unprogrammed tasks.
Challenges Faced
One of the biggest challenges in creating the HPT was building the massive dataset needed to pre-train the transformer. This included 52 datasets with over 200.000 robotic trajectories across four categories, including videos of human demonstrations and simulations. Additionally, the researchers needed to develop an efficient way to transform raw proprioceptive signals from a variety of sensors into data the transformer could process.
"Proprioception is crucial for enabling many right-handed movements," explains Lirui Wang, lead author of the study. "Because the number of tokens is always the same in our architecture, we give equal importance to proprioception and vision."
The Future of Robotics with HPT
In the future, the researchers plan to study how data diversity can further improve HPT's performance. They also hope to enhance HPT so that it can process unlabeled data, following in the footsteps of large-scale language models like GPT-4. This could lead to a system where the robot continuously learns from new experiences, without the need for constant human intervention to label data.
"Our dream is to have a universal robotic brain that you can download and use on your robot without any training," says Wang. "While we're still in the early stages, we'll keep pushing forward and hope that scalability will lead to breakthroughs in robotics policy, just as it did with large language models."
Conclusion
The MIT research represents a significant advance in the quest for efficient and adaptable general-purpose robots. By combining large amounts of heterogeneous data into a unified architecture, the researchers have paved the way for robots that can learn a variety of tasks without requiring extensive training for each new situation. This approach has the potential to revolutionize robotics, enabling the development of more versatile robots capable of adapting to unfamiliar environments and tasks, bringing us ever closer to the vision of robots like Rosie from The Jetsons.
References
This work was funded in part by the Amazon Greater Boston Tech Initiative and the Toyota Research Institute. The research was presented at the Conference on Neural Information Processing Systems and is available for reading in full in the paper "Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers."