[data science] CVPR 2022: Robot Learning Tutorial (Vision, Text)
Pick-and-place robotics are the primary thing missing for makerspaces to be able to provide systems that automatically produce prototype circuit boards. [for me and likely others on this list, building the robotics and establishing with the makerspaces is the hard part there; this tutorial is on the software end] The simple tutorial moves blocks according to human language. https://sites.google.com/view/cvpr2022-robot-learning Conference on Computer Vision and Pattern Recognition 2022 Tutorial on Vision-based Robot Learning When: June 20th, 9am to 12pm. Where: Both in-person and online. From: - FlingBot - site: https://flingbot.cs.columbia.edu/ - paper: https://arxiv.org/abs/2105.03655 - code: https://github.com/columbia-ai-robotics/flingbot - Implicit Behavioral Cloning - site: https://implicitbc.github.io/ - paper: https://arxiv.org/abs/2109.00137 - code: https://github.com/google-research/ibc - NeRF-Supervision - site: http://yenchenlin.me/nerf-supervision/ - paper: https://arxiv.org/abs/2203.01913 - code: https://github.com/yenchenlin/nerf-supervision-public Objective (1) Introduce research on Computer Vision for Robotics. (2) Provide hands-on experience for participants to directly run a simulated robotics environment via Colab notebook in a browser. Program / schedule (times are in New Orleans time -- CDT, GMT /UTC - 06:00) 9:00 Introduction 9:15 Interactive Tutorial (with colab walkthrough!) -- Learning Visuomotor Policies Together with Language and Visual-Language Models Part 1: Introduction to models, environments, data, and begin training. 10:00 Coffee Break 10:15 Talk: "All the Good Stuff" -- A Toolkit for Vision-Based Robot Learning 10:45 Talk: "Memoirs from a Vision Researcher" -- What I Wish I Knew, Starting to Work on Robotics Self-supervised learning and reinforcement Learning 11:15 Talk: "We Don't Even Need to Train Anymore?" -- Zero-Shotting Robotics 11:45 Interactive Tutorial (with colab walkthrough!) -- Learning Visuomotor Policies Together with Language and Visual-Language Models Part 2: Deploy policies, interactive, Q&A 12:15 End! Animated GIFs Example Colab Tutorial Walkthrough: Interactive Visuomotor Policies with Language Models and Visual-Language Models https://lh5.googleusercontent.com/UTKjqbzrZ8Oj_iMaKEcFahfdLeKgHNMi7nXc-gLAZs... FlingBot https://lh6.googleusercontent.com/S3PSGoE-xNaeTkYqaMVNUREM3Zf3yz_ZweTMbA40Zn... Implicit Behavioral Cloning https://lh5.googleusercontent.com/oBhhKBJ5fgRscdr2BRZwHix4YkOGEizkm11Is6QGyC... NeRF-Supervision https://lh6.googleusercontent.com/x5KGeK_3CH5jLlXewY9IGT_8DYrRCk1NxWPjTGGZeo...
Intro Robotics and computer vision have been intertwined from the start. Lawrence Roberts wrote "Blocks World" in 1963, 3 years before 1966 "Summer Vision Project". MIT made Copy Demo in 1970, inspired by "Blocks World".
From ensuing robot demos, the bottleneck was deemed to be edge detection. This inspired decades of work in CV in edge detection.
For more: Steve Seitz "History of 3D Computer Vision", 2011, YouTube. Colab Tutorial Intro You could get an entire PhD by adding lines of code to this colab. socraticmodels.github.io -> code source: https://github.com/google-research/google-research/tree/master/socraticmodel... tutorial notebook: https://colab.research.google.com/drive/1jAyhumd7DTxJB2oZufob9crVxETAEKbV Opens "Socratic Models: Robot Pick & Place" . The colab notebook is not needed: the code can be pasted into a local machine with a gpu. (likely one set up to work with pytorch). The tutorial uses OpenAI and requests the user register for a key. It downloads pretrained models but can be configured to train them instead. The GPU load can be shown by running !nvidia-smi ; it's good to do this regularly. OpenAI's interface to GPT-3 is used to transform human instructions into steps. CLIP and ViLD free pretrained models are used to interpret the visual scene (i think). Pick-and-place is done via a pretrained model, publicly available for download, called CLIPPort / Transporter Nets. This model is further trained. The lecturer opens up each code section and explains them (i am behind, was finding my openai key). An openai gym-style reinforcement learning environment (a custom Env python class) is used. One was built to express the robot's interaction. It does not precisely follow the Env spec. The pybullet library used can do its own inverse kinematics (this is a lot of geometry that lets the user specify where to move the endpoint of a limb without having to solve all the motor angles.) They simulate both orthographic and perspective cameras. The two cameras produce different kinds of model bias when trained on. The imagery from the cameras and the trained models is the only information the robot has on the world. The objects are placed randomly. ViLD The code is taken from ViLD's examples. ViLD produces a list of objects in the scene, and where they are; although the spacial information is not yet used. Scripted Expert & Dataset The scripted expert automatically produces demonstrations of behavior picking and placing objects for the model to learn. Another approach is to use a reinforcement learning agent. These are put into a "dataset". There are also pregenerated datasets to download. If "load_pregenerated" is unchecked, the colab will show videos of the example data moving the simulated arm. To make real data, the code for the section should be modified to haev a number greater than 2, such as 9999 or 100000. This takes more time, and trains a better model. During training, the environment is reset to place new objects in different places. ... Training Transporter Nets work visually. The likely images of source and destination and processed, and the path for moving discerned by convolving them together. An image of this shows all the picking and placing spots that match the user's request. The approach accommodates needs for rotation and translation in response to how an object was grasped. (this could also be done analytically with matrices). Tensorboard is used to show a display tracking training over time. To train your own model, a checkbox must be unchecked to not load a pretrained model. The setup takes quite a bit of memory. The training is done conditioned on specific variations earlier configured by the dev: objects, colors, and locations -- and sentence forms that were present in the training: "pick the [] and place it on []". This is because it isn't using a real language model like GPT-3 to process the instructions yet. After this training, using the model is as simple as providing a prompt: "Pick the yellow block and place it on the blue bowl" and running it through the same simulator and model. The colab renders it visually. I believe the lecturer recommends beginning to train a model, which can take a few hours, and then the training may be done by the time for next tutorial chunk where more models are integrated.
Toolkit for Vision-Based Robot Learning Research What are the essential tools? Connects back to colab tutorial. Methods: - reinforcement learning - imitation learning - self-supervised learning - just prompting a large language model to solve everything Tools: - understanding real-robot challenges - collecting data - environments and tasks - parameterizing observations and actions Methods end up in titles and abstracts, and exhaustively in methods sections: but tools explain why these things matter, explain how the idea came together, and empower [something]. Not so uncommon: - a short 10-100 line published algorithm - but everything else is 1k to 10k lines of cdes. lots of this in the colab Everything else: evluation, training, simulation, data ... Papers can look like a ton of effort went into running experiments; but with an evaluation harness set up, it can be simple and quick. Toolkit for vision-based robot-learning: - Real robot opportunities + limitations - "Big picture challenges" in robotics - Environments and tasks - [other things i missed....] Real Robots: final metric, $-limited access Simulating Robots: high-throughput, anyone can contribute It can seem hard to do something with a robot, so it's good to pick a task with significant return. What can robots easily do? - check videos. things easy to make robots do have been recorded. things hard to make them do, are missing from the videos. regularities in the setup and environment probably show limitations of the robot under recording. - robot competitions also show this - robotics industry has accomplished surprising things - [missed] Big Picture Challenges make your own list 1. can seem hard to get tons of data 2. #1 seems worse from embodiment x task-specification space. unlike, say, image recognition, robots and robot situations are very diverse. 3. hard to evaluate, in the real world, how something will behave over many trials one approach: 1. motivate from real world opportunities + limitations 2. make a relevant simulation, compare all the ideas 3. validate and compare a small subset [the usual approach is to evaluate with things not present in development. helps find more issues.] .. [missed some things] Robust and mature evaluation protocols are missing from robotics; other domains have these; contribution requested. Metrics - success -> how often does it work? especially for new, or not previously possible, things? - proxies for success: MSE dynamics prediction error, PCK@5 keypoint localization - hard to capture qualitative behaviors in quantitative metrics -> videos helpful People argue over whether to have more videos, or more tables. Both are needed. Environment and Task Selection Both an art and a science. Good tasks? A recent task is "Hanging a mug on a rack". - visually clear - satisfying whether or not success happened discourages people (visitors, researchers, investors) when success is not satisfying. - requires precision - requires 6-dof (degrees of freedom, number of independent rotary motions) - single, rigid object, rigid target - lends well to class-level generalization "Sorting blue blocks from yellow blocks" - visually clear and satisfying - handful of objects, not just one, all with desired configurations - ordering doesn't matter -> large amount of multi-modality - table-top pushing with single-point-of-contact (rather than grabbing) -> feedback Standard/benchmark tasks are helpful. - D4L (adroit door example), Ravens (kitting example), RLBench, MetaWorld, Robomimic, ... - Dig through sim code for standard tasks - Watch videos to see what's qualitatively happening (don't just look at numbers) Task-conditioning: - pose-conditioned - image-conditioned - "task id" conditioned - demonstrated goal configurations - language conditioned Thinking of large pretrained models like Dall-E, a goal of diverse natural-language-conditioned tasks. CLIPPort's task steps move toward this. Advice - understand what's already out there before building your own thing - learn a simulation framework (or 5 or 10), be able to set up your own (try the colab) - make a learning agent in sim match what is available in the real world (don't let it cheat) - can be useful: visually clear + satisfying - can be useful: minimal complexity to test skills - might be necessary? diversity - [appreciate the work that goes into high quality sim environments?] Data is collected in an environment, under a policy. - environment may or may not differ from evaluation environment - policy: temporally sequential, trajectories of objectives+actions+rewards/labels/etc; task-specific? on which task? task-agnostic? "play"? how was it provided? is it scripted? human demonstration? somebody else's learned policy? Observation and Action Representation - agnostic to choosing model - at level of robot - [...] Observation Representation - images - robot measurements: joints, cartesian states, force, torque, tactile sensors, sound ... - language - multimodal - [...] Action Representation What it is: 1. Move until done - pixel is action, pose in space - practical, common nowadays 2. Continuous (time / high rate discrete) - joints, pose - likely to be more popular down the road, as it can seem harder How it's represented: - discrete vs continuous - classification can make ease; discrete spaces - in the real world, spaces are continuous [...] Back to the Real World Everything is asynchronous. Sensors, updates: the world is no longer a constant update rate, but made of physical components that update with delay that can be random. These all connect with the colab. Real robot opportunities + limitations "Big picture challenges" Environments and tasks Data Parameterizing observations/actions Back to Real Robots --------------- Michael S Ryoo Recipes from a Vision Researcher Computer Vision Research Task: Object classification, detection ... [i did soemthing else for most of this] reinforcement learning self-supervised learning: - masking part of data, predict the rest - predict mutations, like image rotation self-supervised + reinforcement learning: - curl - soft actor critic, one model for both - [...] self-supervised losses: - rotation, shuffling, reconstruction, RL context prediction - some of these actually worsen performance - carefully designed image augmentation helps - evolutionary algorithm improved ----------------- No Training Using Pretrained Zero-shot Models generated abstract for presentation using gpt-3's zero-shot capabilities finetuning - can also be done by adding a linear layer to existing model zeroshot - no training needed - downsides but works surprisingly well works because: - _tons_ of data - _huge_ models current big successes: - language processing - image classification Language Modeling - predict the next word in a sentence - results in reuse for most language tasks by providing prompts CLIP - images are aligned with captions - expressions are considered similar if they caption similar images - can then be used to classify images by how well they compare to phrase labels Robots - data is lacking, but developing rapidly - existing pretrained models such as GPT and CLIP are adapted [finetuned or zero-shot] CLIPPort [skipped to demo since behind time and CLIPPort described by others]: robots successfully "close a loop on a piece of string" or "pick up cherries" in response to instructions. Finetuning can have drawbacks; losses in original capabilities of model. Language Models as Planners: proper prompting gives high-level plans. Prompt: "...Browse the internet Step 1:" A huge language model yields "Walk to the home office." BERT is used to adjust this to instructions that are actually relevent. [missed part, maybe it selects from options] [i think then the step was appended to the prompt, and the cycle repeated] [...] CLIP on Wheels: COW: (Gadre et al, 2022) - zeroshot models are used to compose the parts of an agent - objects can be classified via language, gradient, patch - where to search next could be learning-based or frontier-based - choice of explore or exploit - Habitat, RoboTHOR CLIP vision and language turns visuals into language. Then gradient-based relevence method used to select actions from language using geometric relevence maps. Explore or exploit selected based on confidence of object identity. No training needed. Limitations - biases unmapped, expected to be big - not as accurate as finetuning - mainstream large models are pricey
Tutorial Part 2 Under Socratic Models, the code. For using a language model to select steps, the model is prompted with a sequence of instruction code. The instructions are comments, and a number of diverse examples are given. [this sequence looks to me like the result of manual prompt tuning: trying things and seeing what kinds of prompts produce good results] They did this tuning in the OpenAI playground (available in the public openai beta). It's just language model prompting. [It's notable that they don't include examples of failure conditions, leaving that to other handling. This lets them stuff more examples into the relatively short prompt, which produces better performance.] Having already provided sets of named objects for the model, GPT can instruct it by passing words in its pretraining dataset to the robot's method functions. - Recommends increasing trains steps above the hardcoded default of 4000 in the colab, for better performance. ViLS is used to generate descriptions of the objects automatically; this is fed as context [likely prompt material] to the language model. Their example actually generates the expected language instructions, with its control method, and passes it on to be reparsed. There's mention of improved performance. The demonstrator comes up with a novel prompt, "place the blocks into mismatched bowls", which the language model crafts 2 steps for that match the meaning; they wanted 3 steps, something went as they didn't expect. Limitations For Creative People to Address - Scene representation is naive: more fine-grained attributes, more complex spacial relationships (doesn't like "pick the thing that's on top of", more complex objects? - Bounded perception outputs: image -> rich description, instead of image -> object labels -> scores - No closed-loop feedback to replan, or to handle when the models miss parts of the scenario - Bounded planning outputs: how to learn new low-level skills? combination of known primitives can seem limited Says lots more things will come out in coming months. Slides are on website. Talk was recorded. I likely won't note the rest of this this way right now.
participants (1)
-
Undiscussed Horrific Abuse, One Victim of Many