2022-10-07 0935+0000 notes while away: - causal shape exception seems like a sequence length issue elsewhere, sounds fun - exponential decay may need to be updated every epoch - epochs should have multiple batches 1203 it actually looks like using a causal mask here would mean improving upon the research in the paper, or at least understanding the math more in-depth. without already having a working example of anything at all, it would make sense to not use one at first, which kind of means classification or regression data rather than sequential generation data 1229 i glanced through this gpt implementation and it seems to always have the causal mask enabled. i quickly copied the core changes in the gpt algorithm over to bert, untested, pushed to git. excited to switch to bert. it was hard to make the decision. 1425 popping on between stuff to note that causal masking can likely be done simplistically by changing the binding sum to an accumulation. this might broadcast the unbinding, slowing the kernel, but not changing the overall complexity, could be wrong 1636 i’ve patched cumsum into the gpt variant for causal masking, and gotten my test to run. loss actually is reducing steadily on the first run O_O kind of dissociated and pushed to git quickly 1727 in my tiny test (2M tokens of binary addition), i see slightly better results w/out HRR, and faster time w/ HRR. i’m using the cumsum hack and an incredibly tiny gpt model with only 4 params. (i dropped the params to see if the approaches would get more different results; they didn’t) i’m not seeing order of magnitude differences with this tiny setup. 1755 i added layer and separated out the data gen overhead from the actual model passes and there is actually a 60% training speedup with comparable loss, 8 params and the cumsum slowdown. my interest has waned some. i’d like to port t5 since seq2seq models are supposed to perform much better, and org my code, … 1812 https://github.com/mosaicml/composer/issues/1602 editor’s note for visitors not experienced with machine learning: models can have billions of parameters and operate on very complex data, so my different results with only a few params and unimaginably trivial data would not be considered relevant