Re: [ot][spam][crazy] crazylogs: STaR

5 Jul 2022

      translation to amnesiatic english

On 7/5/22, Undiscussed Horrific Abuse, One Victim of Many
<gmkarl@gmail.com> wrote:
...
my existing work is at https://github.com/xloem/rnlp .
i just got composer_nb_alter_2.py to work.
this is just implementation of example code for a library called
'mosaicml composer' which i am trying out, since my system is low end.

it theoretically provides a pluggable framework to add optimizing
additions to model training, to require fewer resources.

the example uses a research data set to finetune a model to detect
whether a phrase would be subjectively judged as "positive" or
"negative" by a human being.

models are made of a network of probabilities called weights.  the
probabilities are stored in logarithmic space, so they are called log
probs.

training a model means running a bunch of data through it, comparing
the output to what is correct, taking the derivative, and multiplying
every weight by a small proportion of the derivative, for a very long
time, until all those log probs produce the results considered
correct.

finetuning means taking a trained model, and doing that a little bit
more with different data that has a similarity, so that the model will
respond to the different data. it is _much_ faster than training, but
requires an existing pretrained model of which there are many.

the STaR paper goes a little beyond finetuning, in order to require
even less new data. this is called data augmentation and is also
studied.
...
i took an existing training sample for a masked language model, and
made it work with a generative language model.
research models appear to usually be what's called "masked language
models", a huge one called BERT. these are well-respected models that
are trained to fill in missing words in text (the words are "masked").

the popular models, like GPT, i'm calling "generative" models here.
they aren't trained to handle masking: instead they predict the next
word. i think they're designed to train on more data faster than a
normal model, hence the presence of large and powerful ones. these
models are trained so that as they are called repeatedly, they
generate long paragraphs of text similar to their training data.

the two kinds of models are quite similar but the software interfaces
to access them, which are generally designed around research norms
rather than DRY or anything, are quite different; and finetuning would
be required to use either one as the other form effectively.
...
the smallest pretrained generative models i found are bigger, so it
takes forever to run on a cpu since it won't fit on my gpu.
finetuning a pretrained model is a fast and effective operation.
i am just restating here how impressive it can be that you can
download a model that took a datafarm to produce and, on a home gpu,
quickly finetune it to accurately perform a task its creators never
imagined.
...
we don't actually need to use a generative model for this. the STaR
can be done with a classifier. the model doesn't need to write out the
answer, just pick it.
here i am treating masked language models as classifiers. my software
interface caused them to appear this way. a classifier is a model that
can only output a selection of one among a number of set classes.

to me, technically GPT looks like a classifier to me. its set of
classes is simply its entire vocabulary.

more information: these models output a log prob for every class. the
one with the largest value is then described as the choice. this works
well for propagating the derivatives backward during training, and
also lets the model function so as to guess a distribution of
likelihood when used.
...
however, the authors of the paper used a generative model.
i'm thinking about trying to generalise my code to use either a
generative model, or a pretrained model. the two are very similar. i
was likely planning to do this.