Toolkit for Vision-Based Robot Learning Research What are the essential tools? Connects back to colab tutorial. Methods: - reinforcement learning - imitation learning - self-supervised learning - just prompting a large language model to solve everything Tools: - understanding real-robot challenges - collecting data - environments and tasks - parameterizing observations and actions Methods end up in titles and abstracts, and exhaustively in methods sections: but tools explain why these things matter, explain how the idea came together, and empower [something]. Not so uncommon: - a short 10-100 line published algorithm - but everything else is 1k to 10k lines of cdes. lots of this in the colab Everything else: evluation, training, simulation, data ... Papers can look like a ton of effort went into running experiments; but with an evaluation harness set up, it can be simple and quick. Toolkit for vision-based robot-learning: - Real robot opportunities + limitations - "Big picture challenges" in robotics - Environments and tasks - [other things i missed....] Real Robots: final metric, $-limited access Simulating Robots: high-throughput, anyone can contribute It can seem hard to do something with a robot, so it's good to pick a task with significant return. What can robots easily do? - check videos. things easy to make robots do have been recorded. things hard to make them do, are missing from the videos. regularities in the setup and environment probably show limitations of the robot under recording. - robot competitions also show this - robotics industry has accomplished surprising things - [missed] Big Picture Challenges make your own list 1. can seem hard to get tons of data 2. #1 seems worse from embodiment x task-specification space. unlike, say, image recognition, robots and robot situations are very diverse. 3. hard to evaluate, in the real world, how something will behave over many trials one approach: 1. motivate from real world opportunities + limitations 2. make a relevant simulation, compare all the ideas 3. validate and compare a small subset [the usual approach is to evaluate with things not present in development. helps find more issues.] .. [missed some things] Robust and mature evaluation protocols are missing from robotics; other domains have these; contribution requested. Metrics - success -> how often does it work? especially for new, or not previously possible, things? - proxies for success: MSE dynamics prediction error, PCK@5 keypoint localization - hard to capture qualitative behaviors in quantitative metrics -> videos helpful People argue over whether to have more videos, or more tables. Both are needed. Environment and Task Selection Both an art and a science. Good tasks? A recent task is "Hanging a mug on a rack". - visually clear - satisfying whether or not success happened discourages people (visitors, researchers, investors) when success is not satisfying. - requires precision - requires 6-dof (degrees of freedom, number of independent rotary motions) - single, rigid object, rigid target - lends well to class-level generalization "Sorting blue blocks from yellow blocks" - visually clear and satisfying - handful of objects, not just one, all with desired configurations - ordering doesn't matter -> large amount of multi-modality - table-top pushing with single-point-of-contact (rather than grabbing) -> feedback Standard/benchmark tasks are helpful. - D4L (adroit door example), Ravens (kitting example), RLBench, MetaWorld, Robomimic, ... - Dig through sim code for standard tasks - Watch videos to see what's qualitatively happening (don't just look at numbers) Task-conditioning: - pose-conditioned - image-conditioned - "task id" conditioned - demonstrated goal configurations - language conditioned Thinking of large pretrained models like Dall-E, a goal of diverse natural-language-conditioned tasks. CLIPPort's task steps move toward this. Advice - understand what's already out there before building your own thing - learn a simulation framework (or 5 or 10), be able to set up your own (try the colab) - make a learning agent in sim match what is available in the real world (don't let it cheat) - can be useful: visually clear + satisfying - can be useful: minimal complexity to test skills - might be necessary? diversity - [appreciate the work that goes into high quality sim environments?] Data is collected in an environment, under a policy. - environment may or may not differ from evaluation environment - policy: temporally sequential, trajectories of objectives+actions+rewards/labels/etc; task-specific? on which task? task-agnostic? "play"? how was it provided? is it scripted? human demonstration? somebody else's learned policy? Observation and Action Representation - agnostic to choosing model - at level of robot - [...] Observation Representation - images - robot measurements: joints, cartesian states, force, torque, tactile sensors, sound ... - language - multimodal - [...] Action Representation What it is: 1. Move until done - pixel is action, pose in space - practical, common nowadays 2. Continuous (time / high rate discrete) - joints, pose - likely to be more popular down the road, as it can seem harder How it's represented: - discrete vs continuous - classification can make ease; discrete spaces - in the real world, spaces are continuous [...] Back to the Real World Everything is asynchronous. Sensors, updates: the world is no longer a constant update rate, but made of physical components that update with delay that can be random. These all connect with the colab. Real robot opportunities + limitations "Big picture challenges" Environments and tasks Data Parameterizing observations/actions Back to Real Robots --------------- Michael S Ryoo Recipes from a Vision Researcher Computer Vision Research Task: Object classification, detection ... [i did soemthing else for most of this] reinforcement learning self-supervised learning: - masking part of data, predict the rest - predict mutations, like image rotation self-supervised + reinforcement learning: - curl - soft actor critic, one model for both - [...] self-supervised losses: - rotation, shuffling, reconstruction, RL context prediction - some of these actually worsen performance - carefully designed image augmentation helps - evolutionary algorithm improved ----------------- No Training Using Pretrained Zero-shot Models generated abstract for presentation using gpt-3's zero-shot capabilities finetuning - can also be done by adding a linear layer to existing model zeroshot - no training needed - downsides but works surprisingly well works because: - _tons_ of data - _huge_ models current big successes: - language processing - image classification Language Modeling - predict the next word in a sentence - results in reuse for most language tasks by providing prompts CLIP - images are aligned with captions - expressions are considered similar if they caption similar images - can then be used to classify images by how well they compare to phrase labels Robots - data is lacking, but developing rapidly - existing pretrained models such as GPT and CLIP are adapted [finetuned or zero-shot] CLIPPort [skipped to demo since behind time and CLIPPort described by others]: robots successfully "close a loop on a piece of string" or "pick up cherries" in response to instructions. Finetuning can have drawbacks; losses in original capabilities of model. Language Models as Planners: proper prompting gives high-level plans. Prompt: "...Browse the internet Step 1:" A huge language model yields "Walk to the home office." BERT is used to adjust this to instructions that are actually relevent. [missed part, maybe it selects from options] [i think then the step was appended to the prompt, and the cycle repeated] [...] CLIP on Wheels: COW: (Gadre et al, 2022) - zeroshot models are used to compose the parts of an agent - objects can be classified via language, gradient, patch - where to search next could be learning-based or frontier-based - choice of explore or exploit - Habitat, RoboTHOR CLIP vision and language turns visuals into language. Then gradient-based relevence method used to select actions from language using geometric relevence maps. Explore or exploit selected based on confidence of object identity. No training needed. Limitations - biases unmapped, expected to be big - not as accurate as finetuning - mainstream large models are pricey