translation to amnesiatic english On 7/5/22, Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
my existing work is at https://github.com/xloem/rnlp .
i just got composer_nb_alter_2.py to work.
this is just implementation of example code for a library called 'mosaicml composer' which i am trying out, since my system is low end. it theoretically provides a pluggable framework to add optimizing additions to model training, to require fewer resources. the example uses a research data set to finetune a model to detect whether a phrase would be subjectively judged as "positive" or "negative" by a human being. models are made of a network of probabilities called weights. the probabilities are stored in logarithmic space, so they are called log probs. training a model means running a bunch of data through it, comparing the output to what is correct, taking the derivative, and multiplying every weight by a small proportion of the derivative, for a very long time, until all those log probs produce the results considered correct. finetuning means taking a trained model, and doing that a little bit more with different data that has a similarity, so that the model will respond to the different data. it is _much_ faster than training, but requires an existing pretrained model of which there are many. the STaR paper goes a little beyond finetuning, in order to require even less new data. this is called data augmentation and is also studied.
i took an existing training sample for a masked language model, and made it work with a generative language model.
research models appear to usually be what's called "masked language models", a huge one called BERT. these are well-respected models that are trained to fill in missing words in text (the words are "masked"). the popular models, like GPT, i'm calling "generative" models here. they aren't trained to handle masking: instead they predict the next word. i think they're designed to train on more data faster than a normal model, hence the presence of large and powerful ones. these models are trained so that as they are called repeatedly, they generate long paragraphs of text similar to their training data. the two kinds of models are quite similar but the software interfaces to access them, which are generally designed around research norms rather than DRY or anything, are quite different; and finetuning would be required to use either one as the other form effectively.
the smallest pretrained generative models i found are bigger, so it takes forever to run on a cpu since it won't fit on my gpu.
finetuning a pretrained model is a fast and effective operation.
i am just restating here how impressive it can be that you can download a model that took a datafarm to produce and, on a home gpu, quickly finetune it to accurately perform a task its creators never imagined.
we don't actually need to use a generative model for this. the STaR can be done with a classifier. the model doesn't need to write out the answer, just pick it.
here i am treating masked language models as classifiers. my software interface caused them to appear this way. a classifier is a model that can only output a selection of one among a number of set classes. to me, technically GPT looks like a classifier to me. its set of classes is simply its entire vocabulary. more information: these models output a log prob for every class. the one with the largest value is then described as the choice. this works well for propagating the derivatives backward during training, and also lets the model function so as to guess a distribution of likelihood when used.
however, the authors of the paper used a generative model.
i'm thinking about trying to generalise my code to use either a generative model, or a pretrained model. the two are very similar. i was likely planning to do this.