[data science] CVPR 2022: Robot Learning Tutorial (Vision, Text)

Mon Jun 20 09:55:54 PDT 2022

Toolkit for Vision-Based Robot Learning Research

What are the essential tools? Connects back to colab tutorial.

Methods:
- reinforcement learning
- imitation learning
- self-supervised learning
- just prompting a large language model to solve everything

Tools:
- understanding real-robot challenges
- collecting data
- environments and tasks
- parameterizing observations and actions

Methods end up in titles and abstracts, and exhaustively in methods
sections: but tools explain why these things matter, explain how the
idea came together, and empower [something].

Not so uncommon:
- a short 10-100 line published algorithm
- but everything else is 1k to 10k lines of cdes. lots of this in the colab

Everything else: evluation, training, simulation, data ...

Papers can look like a ton of effort went into running experiments;
but with an evaluation harness set up, it can be simple and quick.

Toolkit for vision-based robot-learning:
- Real robot opportunities + limitations
- "Big picture challenges" in robotics
- Environments and tasks
- [other things i missed....]

Real Robots: final metric, $-limited access
Simulating Robots: high-throughput, anyone can contribute

It can seem hard to do something with a robot, so it's good to pick a
task with significant return.

What can robots easily do?
- check videos. things easy to make robots do have been recorded.
things hard to make them do, are missing from the videos. regularities
in the setup and environment probably show limitations of the robot
under recording.
- robot competitions also show this
- robotics industry has accomplished surprising things
- [missed]

Big Picture Challenges

make your own list
1. can seem hard to get tons of data
2. #1 seems worse from embodiment x task-specification space. unlike,
say, image recognition, robots and robot situations are very diverse.
3. hard to evaluate, in the real world, how something will behave over
many trials
one approach:
 1. motivate from real world opportunities + limitations
 2. make a relevant simulation, compare all the ideas
 3. validate and compare a small subset

[the usual approach is to evaluate with things not present in
development. helps find more issues.]

.. [missed some things]

Robust and mature evaluation protocols are missing from robotics;
other domains have these; contribution requested.

Metrics

- success -> how often does it work? especially for new, or not
previously possible, things?

- proxies for success: MSE dynamics prediction error, PCK at 5 keypoint
localization

- hard to capture qualitative behaviors in quantitative metrics ->
videos helpful

People argue over whether to have more videos, or more tables. Both are needed.

Environment and Task Selection

Both an art and a science.

Good tasks?
A recent task is "Hanging a mug on a rack".
- visually clear
- satisfying whether or not success happened
discourages people (visitors, researchers, investors) when success is
not satisfying.
- requires precision
- requires 6-dof (degrees of freedom, number of independent rotary motions)
- single, rigid object, rigid target
- lends well to class-level generalization

"Sorting blue blocks from yellow blocks"
- visually clear and satisfying
- handful of objects, not just one, all with desired configurations
- ordering doesn't matter -> large amount of multi-modality
- table-top pushing with single-point-of-contact (rather than
grabbing) -> feedback

Standard/benchmark tasks are helpful.
- D4L (adroit door example), Ravens (kitting example), RLBench,
MetaWorld, Robomimic, ...
- Dig through sim code for standard tasks
- Watch videos to see what's qualitatively happening (don't just look
at numbers)

Task-conditioning:
- pose-conditioned
- image-conditioned
- "task id" conditioned
- demonstrated goal configurations
- language conditioned

Thinking of large pretrained models like Dall-E, a goal of diverse
natural-language-conditioned tasks. CLIPPort's task steps move toward
this.

Advice
- understand what's already out there before building your own thing
- learn a simulation framework (or 5 or 10), be able to set up your
own (try the colab)
- make a learning agent in sim match what is available in the real
world (don't let it cheat)
- can be useful: visually clear + satisfying
- can be useful: minimal complexity to test skills
- might be necessary? diversity
- [appreciate the work that goes into high quality sim environments?]

Data
is collected in an environment, under a policy.
- environment may or may not differ from evaluation environment
- policy: temporally sequential, trajectories of
objectives+actions+rewards/labels/etc; task-specific? on which task?
task-agnostic? "play"? how was it provided? is it scripted? human
demonstration? somebody else's learned policy?

Observation and Action Representation
- agnostic to choosing model
- at level of robot
- [...]

Observation Representation
- images
- robot measurements: joints, cartesian states, force, torque, tactile
sensors, sound ...
- language
- multimodal
- [...]

Action Representation

What it is:
1. Move until done
- pixel is action, pose in space
- practical, common nowadays
2. Continuous (time / high rate discrete)
- joints, pose
- likely to be more popular down the road, as it can seem harder

How it's represented:
- discrete vs continuous
- classification can make ease; discrete spaces
- in the real world, spaces are continuous
[...]

Back to the Real World

Everything is asynchronous. Sensors, updates: the world is no longer a
constant update rate, but made of physical components that update with
delay that can be random.

These all connect with the colab.

Real robot opportunities + limitations
"Big picture challenges"
Environments and tasks
Data
Parameterizing observations/actions
Back to Real Robots

---------------
Michael S Ryoo
Recipes from a Vision Researcher

Computer Vision Research
Task: Object classification, detection ...

[i did soemthing else for most of this]
reinforcement learning
self-supervised learning:
- masking part of data, predict the rest
- predict mutations, like image rotation

self-supervised + reinforcement learning:
- curl
- soft actor critic, one model for both
- [...]

self-supervised losses:
- rotation, shuffling, reconstruction, RL context prediction
- some of these actually worsen performance
- carefully designed image augmentation helps
- evolutionary algorithm improved

-----------------
No Training Using Pretrained Zero-shot Models

generated abstract for presentation using gpt-3's zero-shot capabilities

finetuning
- can also be done by adding a linear layer to existing model

zeroshot
- no training needed
- downsides but works surprisingly well

works because:
- _tons_ of data
- _huge_ models

current big successes:
- language processing
- image classification

Language Modeling
- predict the next word in a sentence
- results in reuse for most language tasks by providing prompts

CLIP
- images are aligned with captions
- expressions are considered similar if they caption similar images
- can then be used to classify images by how well they compare to phrase labels

Robots
- data is lacking, but developing rapidly
- existing pretrained models such as GPT and CLIP are adapted
[finetuned or zero-shot]

CLIPPort [skipped to demo since behind time and CLIPPort described by
others]: robots successfully "close a loop on a piece of string" or
"pick up cherries" in response to instructions.

Finetuning can have drawbacks; losses in original capabilities of model.

Language Models as Planners: proper prompting gives high-level plans.

Prompt: "...Browse the internet
Step 1:"
A huge language model yields "Walk to the home office."
BERT is used to adjust this to instructions that are actually
relevent. [missed part, maybe it selects from options] [i think then
the step was appended to the prompt, and the cycle repeated]

[...]

CLIP on Wheels: COW: (Gadre et al, 2022)
- zeroshot models are used to compose the parts of an agent
- objects can be classified via language, gradient, patch
- where to search next could be learning-based or frontier-based
- choice of explore or exploit
- Habitat, RoboTHOR
CLIP vision and language turns visuals into language. Then
gradient-based relevence method used to select actions from language
using geometric relevence maps.
Explore or exploit selected based on confidence of object identity.
No training needed.

Limitations
- biases unmapped, expected to be big
- not as accurate as finetuning
- mainstream large models are pricey