ok let's try to implement STaR a little tiny smidge basically, you need to be able to finetune or train a language model. this means paying a few dollars for time, having a powerful GPU, compromising with a very small model, or implementing research algorithms. my plan is to compromise with a very small model. with considering paying a few dollars to train on somebody else's server.
my existing work is at https://github.com/xloem/rnlp . i just got composer_nb_alter_2.py to work. i took an existing training sample for a masked language model, and made it work with a generative language model. the smallest pretrained generative models i found are bigger, so it takes forever to run on a cpu since it won't fit on my gpu. finetuning a pretrained model is a fast and effective operation. we don't actually need to use a generative model for this. the STaR can be done with a classifier. the model doesn't need to write out the answer, just pick it. however, the authors of the paper used a generative model. i'm thinking about trying to generalise my code to use either a generative model, or a pretrained model. the two are very similar. i was likely planning to do this.
translation to amnesiatic english On 7/5/22, Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
my existing work is at https://github.com/xloem/rnlp .
i just got composer_nb_alter_2.py to work.
this is just implementation of example code for a library called 'mosaicml composer' which i am trying out, since my system is low end. it theoretically provides a pluggable framework to add optimizing additions to model training, to require fewer resources. the example uses a research data set to finetune a model to detect whether a phrase would be subjectively judged as "positive" or "negative" by a human being. models are made of a network of probabilities called weights. the probabilities are stored in logarithmic space, so they are called log probs. training a model means running a bunch of data through it, comparing the output to what is correct, taking the derivative, and multiplying every weight by a small proportion of the derivative, for a very long time, until all those log probs produce the results considered correct. finetuning means taking a trained model, and doing that a little bit more with different data that has a similarity, so that the model will respond to the different data. it is _much_ faster than training, but requires an existing pretrained model of which there are many. the STaR paper goes a little beyond finetuning, in order to require even less new data. this is called data augmentation and is also studied.
i took an existing training sample for a masked language model, and made it work with a generative language model.
research models appear to usually be what's called "masked language models", a huge one called BERT. these are well-respected models that are trained to fill in missing words in text (the words are "masked"). the popular models, like GPT, i'm calling "generative" models here. they aren't trained to handle masking: instead they predict the next word. i think they're designed to train on more data faster than a normal model, hence the presence of large and powerful ones. these models are trained so that as they are called repeatedly, they generate long paragraphs of text similar to their training data. the two kinds of models are quite similar but the software interfaces to access them, which are generally designed around research norms rather than DRY or anything, are quite different; and finetuning would be required to use either one as the other form effectively.
the smallest pretrained generative models i found are bigger, so it takes forever to run on a cpu since it won't fit on my gpu.
finetuning a pretrained model is a fast and effective operation.
i am just restating here how impressive it can be that you can download a model that took a datafarm to produce and, on a home gpu, quickly finetune it to accurately perform a task its creators never imagined.
we don't actually need to use a generative model for this. the STaR can be done with a classifier. the model doesn't need to write out the answer, just pick it.
here i am treating masked language models as classifiers. my software interface caused them to appear this way. a classifier is a model that can only output a selection of one among a number of set classes. to me, technically GPT looks like a classifier to me. its set of classes is simply its entire vocabulary. more information: these models output a log prob for every class. the one with the largest value is then described as the choice. this works well for propagating the derivatives backward during training, and also lets the model function so as to guess a distribution of likelihood when used.
however, the authors of the paper used a generative model.
i'm thinking about trying to generalise my code to use either a generative model, or a pretrained model. the two are very similar. i was likely planning to do this.
i'm thinking about trying to generalise my code to use either a generative model, or a pretrained model. the two are very similar. i was likely planning to do this.
i'm thinking about trying to generalise my code to use either a generative model, or a pretrained model. the two are very similar. i was likely planning to do this. it is a norm in software development to factor similarities into reusable components.
i'm thinking about trying to generalise my code to use either a generative model, or a pretrained model. the two are very similar. i was likely planning to do this.
i meant masked model. both are pretrained. restated below. i'm thinking about trying to generalise my code to use either a generative model, or a masked model. the two are very similar. i was likely planning to do this. it is a norm in software development to factor similarities into reusable components.
present state: i'm bumping into a bug in composer where one of the acceleration algorithms tries to use the wrong symbol. likely this worked fine in an older version, or has been fixed in the development version. i'm working on this because i realised that i had not yet enabled composer's acceleration. it isn't enabled in the example i'm copying. i haven't found one with language modeling, where it's enabled: but their docs report extensively from tests running it with language modeling. i thought it would be quick and simple, as mosaicml describes it as being. it may still be.
i found this; the wrong symbol is in the documentation and i had copied it over a rare time when i think it was actually the documentation and not me patch at https://github.com/mosaicml/composer/pull/1259 the next issue is with the next acceleration algorithm. for it to start learning on shorter sequences, the code expects my class to inherit from a different base
i ended up enabling the language modeling accelerations with my time i'm not sure whether they help or not. it looks like the total time on my system using them for the task that's coded in is about 2 hours and 18.3 minutes. it looks like composer is mostly designed for vision models. it has only two text-based techniques implemented; they are mostly for GPT. still it is great to have a general framework that works with existing pretrained models, which is important for the task of finetuning. somewhere out there is a framework for mutating models between forms and representations, but i have not found it. i got distracted finding some of lucidrains recent work; lucidrains has an x-transformers repository at https://github.com/lucidrains/x-transformers that lets one design a transformer with various features described by papers enabled or disabled. most general thing i've run into so far. the 'alibi' technique used for gpt models in composer, and available in lucidrains repository, provides for extending the input length much longer, due to how it mutates attention. there are a number of other techniques for extending this, too. i was sad to not see anything in lucidrains' stuff yet for adding mixture of experts to models. this lets models specialise contextually better and use less runtime by forming on the fly decisions about which parts to use. maybe within the next year.
---- value around setting up so that code runs on something faster some remote system where running these 150 epochs would cost a few cents or less per test and complete faster
rwkv's tiny enwik8 model would likely finetune on my system i also know the rwkv model architecture well enough now to quickly tweak it to run on smaller ram
i foudn https://adapterhub.ml/ it's a repository of little 3MB weights that can be tacked onto existing pretrained models to give them totally new tasks. its logo is a drawing of 'bert' from sesame street, except bert has been 75% turned into a robot, leaving off the right half of their face which is still normal. i imagine the right half of bert's face represents the 3MBs of weights that users can specify so as to have some control over what huge corporations have their machines do.
i'm finding it fun now to consider accessing _all_ of the adapterhub finetuning weights for a model and ensembling them so as to make a supersmart customized model i guess after training a finetuning ensemble you could actually store it as a new finetuning to share with others but maybe it is pretty much the same thing as finetuning a new model but not if you don't have any ram O_O
this looks like a good framework for this job, although of course i am usually wrong - it's designed for finetuning already - it uses a mutated form of finetuning that seems likely to require less ram - it has extensive finetuning documentation [filled in a different reason here after i forgot something] [planned to write something after this but forgot what, could be misremembering]
adapters is language model focused, and adds classes for new research. something along those lines what's cool is if i make a new pretrained adapter based on a paper i could publish it on their hub and maybe become the king of the solar system -- i mean that model adapter -- if somebody else uses it. maybe like happens on a software forge if you write code, but with less fanciness.
i ran the first adapter example at https://docs.adapterhub.ml/quickstart.html it did the exact same thing as the model i am finetuning for 2 hours, and it ran _super first_ at the same time as the ongoing training, with no ram issues. the next section in the quickstart is doing your own training. this is where it gets hard to continue.
i should make a profile picture like the logo of adapter transformers where your whole head is turned into a banal expressionless robot, like, holding a military rifle except for like, part of one eye, and the corners of that eye are crinkled into a caring smile :D
the adapter transformers training documentation is at https://docs.adapterhub.ml/training.html . it's sparse. basically, you train it as if it's a huggingface model, except you add two lines to put an adapter in. then, in theory, it does the same thing but uses much less ram and happens much faster. dunno. the paper i'm trying to copy here, STaR, did not use a huggingface model. so there's more for me to figure out on my own. if i consider it, though, models are pretty much trained all the same way: researchers put their data in formats that are relatively normative, then write scripts to load these formats and run them through models. i haven't fully considered this yet, busy learning about adapter transformers.
honestly all the run_mlm, run_clm scripts used here really bug me, partly because it is so difficult to bidirectionally interoperate python code with shell code, especially in a reusable way but secretly mostly or also because it takes so much more =memory to use two things and move between more panes ... and other issues i have with shifting gears and writing implementations
i like to extract the training loop and put it in a new .py this makes more sense to me since the scripts aren't apis. i want to build an api.
hi i'm doing the things described in this thread i want to implement the STaR paper so i can exchange dialogue with something that is logically consistent i'm not usually able to do this, especially during psychotic breaks, or online
one of the things i like/love about STaR is that the researchers chose to include reasons in their work the model pattern means the model can learn to explain its reasons in arbitrary granularity
On Tue, 5 Jul 2022 18:51:52 -0400 "Undiscussed Horrific Abuse, One Victim of Many" <gmkarl@gmail.com> wrote:
i ended up enabling the language modeling accelerations with my time
this fucking 'karl' asshole has to be banned. It makes no sense to allow hundreds of spam messages per week to be posted as if they were NOT spam.
oh and by the way karl. Now your messages get filtered directly to trash. Just like the messages of your coworker, jewnazi professor turd. Congratulations, asshole.
translation to amnesiatic english: On 7/5/22, Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
ok let's try to implement STaR a little tiny smidge
by STaR I mean this paper: https://arxiv.org/abs/2203.14465 it is a way to use a generative language model to make logical common sense decisions that are accurate when transferred to new domains, _without_ producing a ton of data or performing a ton of model training.
basically, you need to be able to finetune or train a language model. this means paying a few dollars for time, having a powerful GPU,
time can be purchased on google cloud TPUs (this might mesh well with the paper; the model they used was made by people who used TPUs), or vast.ai, or many other places. it is common for language model services to provide finetuning as a paid service (maybe between 20% and 60% of services i've found provide this). a powerful gpu means a lot of VRAM. the lowest end is the tesla K80. higher end gpus start with the letter A and then have a big number after them. nvidia has dominated this for a while but other corporations are stepping up. you can run gpus in parallel if they don't have enough gpu ram or speed, but this does mean learning more code or systems to interface with them.
compromising with a very small model, or implementing research algorithms.
i commonly daydream of research since i have trouble directing my body, so i have a lot of ways in my head to improve on things that are very hard for me to try out. i haven't seen much research, but i get the impression there is a lot of stuff out there that simply needs to be combined together across domains. a _lot_ of stuff. part of it may get pretty obvious if you look inside the source code of machine learning libraries: many things to me have seemed unoptimized. often huge popular papers are simply performing an algebraic operation to reduce system load, like everybody was doing 40 years ago to make anything run at all.
my plan is to compromise with a very small model. with considering paying a few dollars to train on somebody else's server.
using a very small model means it won't be able to hold as many concepts in parallel, or as complex concepts, since it won't have as many places to store separate information. so things work if the domain is made small.
participants (2)
-
punk
-
Undiscussed Horrific Abuse, One Victim of Many