[data science] CVPR 2022: Robot Learning Tutorial (Vision, Text)

Undiscussed Horrific Abuse, One Victim of Many gmkarl at gmail.com
Mon Jun 20 08:00:41 PDT 2022


Intro

Robotics and computer vision have been intertwined from the start.

Lawrence Roberts wrote "Blocks World" in 1963, 3 years before 1966
"Summer Vision Project".

MIT made Copy Demo in 1970, inspired by "Blocks World".

>From ensuing robot demos, the bottleneck was deemed to be edge
detection. This inspired decades of work in CV in edge detection.

For more: Steve Seitz "History of 3D Computer Vision", 2011, YouTube.

Colab Tutorial Intro

You could get an entire PhD by adding lines of code to this colab.
socraticmodels.github.io -> code

source: https://github.com/google-research/google-research/tree/master/socraticmodels

tutorial notebook:
https://colab.research.google.com/drive/1jAyhumd7DTxJB2oZufob9crVxETAEKbV

Opens "Socratic Models: Robot Pick & Place" .

The colab notebook is not needed: the code can be pasted into a local
machine with a gpu. (likely one set up to work with pytorch).

The tutorial uses OpenAI and requests the user register for a key.

It downloads pretrained models but can be configured to train them instead.

The GPU load can be shown by running !nvidia-smi ; it's good to do
this regularly.

OpenAI's interface to GPT-3 is used to transform human instructions into steps.

CLIP and ViLD free pretrained models are used to interpret the visual
scene (i think).

Pick-and-place is done via a pretrained model, publicly available for
download, called CLIPPort / Transporter Nets. This model is further
trained.

The lecturer opens up each code section and explains them (i am
behind, was finding my openai key).

An openai gym-style reinforcement learning environment (a custom Env
python class) is used. One was built to express the robot's
interaction. It does not precisely follow the Env spec.

The pybullet library used can do its own inverse kinematics (this is a
lot of geometry that lets the user specify where to move the endpoint
of a limb without having to solve all the motor angles.)

They simulate both orthographic and perspective cameras. The two
cameras produce different kinds of model bias when trained on.

The imagery from the cameras and the trained models is the only
information the robot has on the world. The objects are placed
randomly.

ViLD

The code is taken from ViLD's examples. ViLD produces a list of
objects in the scene, and where they are; although the spacial
information is not yet used.

Scripted Expert & Dataset

The scripted expert automatically produces demonstrations of behavior
picking and placing objects for the model to learn. Another approach
is to use a reinforcement learning agent.

These are put into a "dataset". There are also pregenerated datasets
to download. If "load_pregenerated" is unchecked, the colab will show
videos of the example data moving the simulated arm. To make real
data, the code for the section should be modified to haev a number
greater than 2, such as 9999 or 100000. This takes more time, and
trains a better model.

During training, the environment is reset to place new objects in
different places.

...

Training

Transporter Nets work visually. The likely images of source and
destination and processed, and the path for moving discerned by
convolving them together. An image of this shows all the picking and
placing spots that match the user's request. The approach accommodates
needs for rotation and translation in response to how an object was
grasped. (this could also be done analytically with matrices).

Tensorboard is used to show a display tracking training over time. To
train your own model, a checkbox must be unchecked to not load a
pretrained model. The setup takes quite a bit of memory.

The training is done conditioned on specific variations earlier
configured by the dev: objects, colors, and locations -- and sentence
forms that were present in the training: "pick the [] and place it on
[]". This is because it isn't using a real language model like GPT-3
to process the instructions yet.

After this training, using the model is as simple as providing a
prompt: "Pick the yellow block and place it on the blue bowl" and
running it through the same simulator and model. The colab renders it
visually.

I believe the lecturer recommends beginning to train a model, which
can take a few hours, and then the training may be done by the time
for next tutorial chunk where more models are integrated.


More information about the cypherpunks mailing list