Tutorial Part 2 Under Socratic Models, the code. For using a language model to select steps, the model is prompted with a sequence of instruction code. The instructions are comments, and a number of diverse examples are given. [this sequence looks to me like the result of manual prompt tuning: trying things and seeing what kinds of prompts produce good results] They did this tuning in the OpenAI playground (available in the public openai beta). It's just language model prompting. [It's notable that they don't include examples of failure conditions, leaving that to other handling. This lets them stuff more examples into the relatively short prompt, which produces better performance.] Having already provided sets of named objects for the model, GPT can instruct it by passing words in its pretraining dataset to the robot's method functions. - Recommends increasing trains steps above the hardcoded default of 4000 in the colab, for better performance. ViLS is used to generate descriptions of the objects automatically; this is fed as context [likely prompt material] to the language model. Their example actually generates the expected language instructions, with its control method, and passes it on to be reparsed. There's mention of improved performance. The demonstrator comes up with a novel prompt, "place the blocks into mismatched bowls", which the language model crafts 2 steps for that match the meaning; they wanted 3 steps, something went as they didn't expect. Limitations For Creative People to Address - Scene representation is naive: more fine-grained attributes, more complex spacial relationships (doesn't like "pick the thing that's on top of", more complex objects? - Bounded perception outputs: image -> rich description, instead of image -> object labels -> scores - No closed-loop feedback to replan, or to handle when the models miss parts of the scenario - Bounded planning outputs: how to learn new low-level skills? combination of known primitives can seem limited Says lots more things will come out in coming months. Slides are on website. Talk was recorded. I likely won't note the rest of this this way right now.