i ran it again until colab logged me out. the loss dropped to 0.7 or so. apparently colab lets you have a gpu again if you disable background execution. i'm running it some more, just for effective use of time. i looked into how longt5 works, and basically it locally contextualises the regions of its inputs, but not the regions of its outputs during generation (it would simply be a flag to change this, but it is how they pretrained it). so it is good at reading very long things, and then outputting very short things that conclude from them. it is also documented as having a limit of 16k tokens, so it is not general. while working i added a tokenizer hack to approach things like linebreaks. i haven't tested it yet, since the unsupervised training (i'm calling it grooming to help stuff), is effective whether the data is perfect or not. [unsupervised grooming is possibly a larger issue]. i also added some stubs for other models: xnn and transformer-xl, which i found the repo for. unfortunately, transformer-xl uses a different kind of tokenizer that my hack doesn't quite work for. still, the time to train another adapter would give space to figure out how to make the other tokenizer work. i think what makes sense next for me, after reviewing the details of this longt5 model i've spent a couple days on, is to find a way to combine the different commit files into one input. this would make the model much more effective as it could learn the meaning between files rather than memorising what files are in a repository, to output specific updates for individual files. i also found the huggingface interface to longt5 lets you 'prompt' the t5 model with initial decoder ids, so if the model accepted all relevant files as input, you could prompt it with each separate file in order to produce output for each one in smaller bundles. since it has a much smaller output window than input window.