[ot][spam][crazy][log] transformer?
https://discord.gg/797Hb2HR https://discord.com/channels/886589909299777576/987686110777966592/102737200... baffo32 2022-10-06 00:09 UTC There’s been a paper quietly kicking around that speeds up model training by up to 370x, flattens architectures to a single layer, drops memory requirements by 10x, and effectively has long context: https://kdd-milets.github.io/milets2022/papers/MILETS_2022_paper_5942.pdf . It’s for NeurIPS 2022, upcoming. I’ve been kicking it around in my mind a smidge, and I’m thinking it would just be so useful and likely appreciated for anybody at all to make any implementation at all of this paper, that it would be worth trying to do. The concept of the paper is not complex, but it’s quite hard for me to approach, as usual.
2022-10-06T08:29+00:00 Last night I started looking into the “alibi” algorithm within https://github.com/mosaicml/composer , which is a plugin-based way to mutate the component of transformers that the milets paper discusses. I’ve forked it to https://github.com/xloem/composer , presently into a placeholder branch named “wip”, and have started laying a little space to put holographic representations in. I’m referring to the source of https://github.com/MahmudulAlam/Holographic-Reduced-Representations for simple and normative ways to implement the core function. It's quite easy to see the effort as one of simple boilerplate, with these resources open next to each other. This morning part of my mind is badly misbehaving, and that’s scary for me. I think some of me is terrified about use of machine learning for oppression. But I think part of me still sees this is very useful. I have two appointments today that are physically difficult for me to prepare for. 09:28 I’ve put hrr utility functions into the gpt2 attention replacement file and changed its attn matmul to be a call to hrr binding (about 1/3rd of the mutations needed) and pushed to git. I’m unsure whether to use 2d or 1d calls, but 2d seems a little more logical, although the paper sums along the sequence dimension so likely only 1d is needed. I’m also unsure yet how I will test my implementation: part of the reason for factoring the hrr calls out. Next step is to learn how gpt tracks the attention mask so as to make it work for hrr. 0942 I’ve realised to use 1d binding rather than 2d so that token masks apply correctly. The distraction of this brief incorrectness is confusing me enough somehow that I’m taking a break to stabilise some. 1204 I’ve started into the second half where the attention output is produced, and pushed to git. I’m handling a lot of spasmodic thinking, very hard to stabilise small concepts around the code in my working memory. I’m thinking I’ll step away to prepare for my day some.
2022-10-06 1319+0000 I’m back for a bit after some things. I feel like looking at the code some more. I have a couple hours to my appointments, and am ready enough to go. 1349 I’ve gotten through the attention output. Torch has an existing function for calculating cosine similarity. Cosine similarity is simply the normalized dot product of the vectors: i.e. the cosine of the n-dimensional angle between them. I’ve pushed my draft to git. Next is organisation, testing, and then often a bit of recoding when I find my mistakes. Noting I still have time before my appointments. 1430 I’ve quickly organised the draft a little and tested it imports and constructs without parse failures. Pushed to git. I’m spinning out a little. My appointments are soon. Next is relearning to use this library to test my added algorithm. 1436 Since I copied an existing algorithm almost exactly, I can run all its tests on my copied code. The tests also include an example of using it, which seems the simplest way to learn this. I’m running into pytest crashing weirdly after updating a dependency. 1457 looks like i need to enable the [dev] extra on mosaicml composer to pull in the test dependencies. the tests run now! i have 5 mistakes to resolve. 1503 i pushed a change to git. the vanilla tests are segfaulting on my system associated with tensorflow. my algorithm doesn’t use tensorflow so i drill down into just it. the tests over the algorithm all pass — but having looked at them i know they don’t verify its output, just its structure. next is to try to use the code to do something simple, and compare a result to without the change. my appt is at 1600; i’m in the habit of going a little early to protect against missing it from weird states of mind i can have.
2022-10-06 2003+0000 I’m working on writing a test training loop. I’ve written a basic data generator, copied in some of the hyperparameters from the paper, and linked the algorithm to a toy gpt2 architecture from transformers. I’m running into strong psychological issues, which is normal for me when trying to write a training loop. 2020 i’ve drafted a training loop. i know it has missing parts i’ve glossed over from how i get when i write them. an exception is throwing in code i thought i had avoided. i’m surprised to still be working. very tangible, the harsh multidirectional internal experiences. i usually take a break when things are this intense, to help retain my memories and thoughts and feelings a little. 2027 i’ve fixed the unexpected exception, which was due to mistakes within the mutations i made to library code that look like they trigger when it is loaded. i’m confused that the tests did not catch them, but at this point the training loop will. i’m now engaging bugs in my test code. i’m doing somewhat better. 2045 i’ve fixed a number of mistakes in my training loop, and am now engaging bugs in the forward function of the algorithm. [i’m taking a break to at least find water, but still feel the multidirectional intensity and energy. i’ve pushed the current state to git.] 2100 i have water. i was working on a remote system so as to run bigger tests than my raspberry pi can hold. the internet has stopped working on the device, interrupting this. i am taking a break, i guess. 2020-10-07 0550 I’m having a sense of stronger cognitive issues than I want to provide for, and am planning to take a break for a few days. The next bug was that the causal mask did not have the right shape for the sequence length.
2022-10-07 0935+0000 notes while away: - causal shape exception seems like a sequence length issue elsewhere, sounds fun - exponential decay may need to be updated every epoch - epochs should have multiple batches 1203 it actually looks like using a causal mask here would mean improving upon the research in the paper, or at least understanding the math more in-depth. without already having a working example of anything at all, it would make sense to not use one at first, which kind of means classification or regression data rather than sequential generation data 1229 i glanced through this gpt implementation and it seems to always have the causal mask enabled. i quickly copied the core changes in the gpt algorithm over to bert, untested, pushed to git. excited to switch to bert. it was hard to make the decision. 1425 popping on between stuff to note that causal masking can likely be done simplistically by changing the binding sum to an accumulation. this might broadcast the unbinding, slowing the kernel, but not changing the overall complexity, could be wrong 1636 i’ve patched cumsum into the gpt variant for causal masking, and gotten my test to run. loss actually is reducing steadily on the first run O_O kind of dissociated and pushed to git quickly 1727 in my tiny test (2M tokens of binary addition), i see slightly better results w/out HRR, and faster time w/ HRR. i’m using the cumsum hack and an incredibly tiny gpt model with only 4 params. (i dropped the params to see if the approaches would get more different results; they didn’t) i’m not seeing order of magnitude differences with this tiny setup. 1755 i added layer and separated out the data gen overhead from the actual model passes and there is actually a 60% training speedup with comparable loss, 8 params and the cumsum slowdown. my interest has waned some. i’d like to port t5 since seq2seq models are supposed to perform much better, and org my code, … 1812 https://github.com/mosaicml/composer/issues/1602 editor’s note for visitors not experienced with machine learning: models can have billions of parameters and operate on very complex data, so my different results with only a few params and unimaginably trivial data would not be considered relevant
participants (1)
-
Undescribed Horrific Abuse, One Victim & Survivor of Many