Abstract
- We present a framework for building interactive, real time, natural language-instructible robots in the real world, and we open-source related assets (dataset, environment, benchmark, and policies).
- Trained with behavioral cloning on a dataset of hundreds of thousands of language-annotated trajectories, a produced policy can proficiently execute an order of magnitude more commands than previous works: specifically, we estimate a 93.5% success rate on a set of 87,000 unique natural language strings specifying raw end-to-end visuo-linguo-motor skills in the real world.
- We find that the same policy is capable of being guided by a human via real-time language to address a wide range of precise long-horizon rearrangement goals, e.g., “Make a smiley face out of blocks”.
- The dataset we release comprises nearly 600,000 language-labeled trajectories, an order of magnitude larger than prior available datasets.
- We hope the demonstrated results and associated assets enable further advancement of helpful, capable, natural-language-interactable robots.
