Intro Robotics and computer vision have been intertwined from the start. Lawrence Roberts wrote "Blocks World" in 1963, 3 years before 1966 "Summer Vision Project". MIT made Copy Demo in 1970, inspired by "Blocks World".
From ensuing robot demos, the bottleneck was deemed to be edge detection. This inspired decades of work in CV in edge detection.
For more: Steve Seitz "History of 3D Computer Vision", 2011, YouTube. Colab Tutorial Intro You could get an entire PhD by adding lines of code to this colab. socraticmodels.github.io -> code source: https://github.com/google-research/google-research/tree/master/socraticmodel... tutorial notebook: https://colab.research.google.com/drive/1jAyhumd7DTxJB2oZufob9crVxETAEKbV Opens "Socratic Models: Robot Pick & Place" . The colab notebook is not needed: the code can be pasted into a local machine with a gpu. (likely one set up to work with pytorch). The tutorial uses OpenAI and requests the user register for a key. It downloads pretrained models but can be configured to train them instead. The GPU load can be shown by running !nvidia-smi ; it's good to do this regularly. OpenAI's interface to GPT-3 is used to transform human instructions into steps. CLIP and ViLD free pretrained models are used to interpret the visual scene (i think). Pick-and-place is done via a pretrained model, publicly available for download, called CLIPPort / Transporter Nets. This model is further trained. The lecturer opens up each code section and explains them (i am behind, was finding my openai key). An openai gym-style reinforcement learning environment (a custom Env python class) is used. One was built to express the robot's interaction. It does not precisely follow the Env spec. The pybullet library used can do its own inverse kinematics (this is a lot of geometry that lets the user specify where to move the endpoint of a limb without having to solve all the motor angles.) They simulate both orthographic and perspective cameras. The two cameras produce different kinds of model bias when trained on. The imagery from the cameras and the trained models is the only information the robot has on the world. The objects are placed randomly. ViLD The code is taken from ViLD's examples. ViLD produces a list of objects in the scene, and where they are; although the spacial information is not yet used. Scripted Expert & Dataset The scripted expert automatically produces demonstrations of behavior picking and placing objects for the model to learn. Another approach is to use a reinforcement learning agent. These are put into a "dataset". There are also pregenerated datasets to download. If "load_pregenerated" is unchecked, the colab will show videos of the example data moving the simulated arm. To make real data, the code for the section should be modified to haev a number greater than 2, such as 9999 or 100000. This takes more time, and trains a better model. During training, the environment is reset to place new objects in different places. ... Training Transporter Nets work visually. The likely images of source and destination and processed, and the path for moving discerned by convolving them together. An image of this shows all the picking and placing spots that match the user's request. The approach accommodates needs for rotation and translation in response to how an object was grasped. (this could also be done analytically with matrices). Tensorboard is used to show a display tracking training over time. To train your own model, a checkbox must be unchecked to not load a pretrained model. The setup takes quite a bit of memory. The training is done conditioned on specific variations earlier configured by the dev: objects, colors, and locations -- and sentence forms that were present in the training: "pick the [] and place it on []". This is because it isn't using a real language model like GPT-3 to process the instructions yet. After this training, using the model is as simple as providing a prompt: "Pick the yellow block and place it on the blue bowl" and running it through the same simulator and model. The colab renders it visually. I believe the lecturer recommends beginning to train a model, which can take a few hours, and then the training may be done by the time for next tutorial chunk where more models are integrated.