[spam] [personal] perceiver model notes
- perceiver was made by google; it's a specific attention tweak to encoder/decoder transformer, i think - perceiver is now in huggingface in src/transformers/models/perceiver some of the classes, for the second half of the pipeline, also preprocessors PerceiverAbstractDecoder: decoder base class PerceiverMultimodalDecoder: decodes model results into many forms of simultaneous results PerceiverBasicVideoAutoencodingDecoder: for models that produce video PerceiverOpticalFlowDecoder: for models that produce optical flow information PerceiverClassificationDecoder: for models that produce labels from a fixed set PerceiverBasicDecoder: a basic decoder for producing data, using cross attention PerceiverProjectionDecoder: a decoder that does not use cross attention Conv2DDownsample: downsamples data 4x using torch.nn.Conv2d and padding PerceiverAbstractPositionEncoding: base class for position encoding PerceiverTrainablePositonEncoding: position encoding that is trained PerceiverFourierPositionEncoding: position encoding that produces normal (fourier sinusoid) position embeddings based on channel people say that without position embeddings, the channel on which data is received is not used as information AbstractPreprocessor: preprocessor base class PerceiverTextPreprocessor: an embedding encoder for perceiver PerceiverEmbeddingDecoder: an embedding decoder for perceiver embeddings are matrices that convert between integer ids (words) and n-dimensional vectors (points in meaning space) PerceiverMultimodalPreprocessor: converts many kinds of data to a single group of input PerceiverAudioPreprocessor: converts audio to transformer input PerceiverOneHotPreprocessor: adds dummy index dimension to input PerceiverImagePreprocessor: convert image to transformer input, performs significant transformation PerceiverMultimodalPostprocessor: unconverts data into different kinds of postprocessed data PerceiverProjectionPostprocessor: uses linear combination to downsample data [training prevents information loss] PerceiverAudioPostprocessor: downsampling for audio features PerceiverClassificationPostprocessor: downsampling for classification log probs to a set of labels
let's check out perceiver's masked language modeling architecture just a little bit PerceiverForMaskedLM (in github.com/transformers/transformers file src/transformers/models/perceiver/modeling_perceiver.py ) i'm skipping down to the implementation of the forward() function, as this will give me a quick description of what the model does with its parts. summary in pseudocode: outputs = self.perceiver(inputs) logits = self.embedding_decoder(selects output part depending on flag) loss = crossentropyloss(logits, labels) return logits and loss ok so it basically has a 'perceiver' member that does the bulk work, and an 'embedding decoder' that postprocesses the data. so i flip up to __init__ when these objects will be created self.perceiver = PerceiverModel( input_preprocessor = PerceiverTextPreprocessor(), decoder = PerceiverBasicDecoder(lots of dimension information) ) self.embedding_decoder = PerceiverEmbeddingDecoder() so it looks like there are 4 parts here the information likely flows through. 1. Maybe PerceiverTextPreprocessor processes the input 2. Maybe PerceiverModel processes the preprocessed data 3. Maybe PerceiverBasicDecoder post-processes the data 4. Finally PerceiverEmbeddingDecoder likely converts high-dimensioned outputs to simple byte probabilities Next is to look at PerceiverModel's forward function to see how important the non-parameterised behavior of the class is.
forward function of perceivermodel, when conditioned with only input_preprocessor and decoder, pseudocode: # step 1 inputs and dimension info = input_preprocessor(inputs) # step 2 encoder_outputs = self.encoder( self.embeddings(batch_size) inputs ) # step 3 decoder_outputs = decoder( decoder.decoder_query(inputs), z=encoder_outputs[0] ) # logits would be additionally processed through output_postprocessor if one were provided return decoder_outputs.logits
so a perceiver model is: input -> preprocessor -> encoder [required] -> decoder [processes encoded and unencoded input together] -> postprocessor -> output let's see how the encoder is constructed in PerceiverModel.__init__: self.embeddings = PerceiverEmbeddings() self.encoder = PerceiverEncoder(kv_dim) So most of the functionality of a perceiver model is in the PerceiverEncoder class.
I'm interested in going through each step and describing the data flow in human words. Or at least starting this. Could make substeps if the steps are complex. Step 1: Input preprocessing. PerceiverTextPreprocessor collects embeddings and position embeddings into one summed tensor. They are both simple trainable embeddings. An embedding converts tokens into vectors that relate to their use. Tokens or bytes -> PerceiverTextPreprocessor -> Embedding vectors
Figuring out Step 2: Encoding Here's a copy paste of this call: embedding_output = self.embeddings(batch_size=batch_size) encoder_outputs = self.encoder( embedding_output, attention_mask=None, head_mask=head_mask, inputs=inputs, inputs_mask=extended_attention_mask, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) It collects the inputs with self.embeddings and produces a tensor. self.embeddings is a PerceiverEmbeddings which is just a matrix of parameters that the training process can mutate. self.encoder is a PerceiverEncoder. It's initialised with a kv_dim parameter that is taken from the input preprocessor: self.encoder = PerceiverEncoder( config, kv_dim=input_preprocessor.num_channels if input_preprocessor is not None else config.d_model ) Here's num_channels in PerceiverTextPreprocessor: @property def num_channels(self) -> int: return self.config.d_model So kv_dim is the `d_model` model configuration parameter. Maybe a way to describe how complex the model is. On to PerceiverEncoder: ## construction cross_attention = PerceiverLayer(is_cross_attention=True, qk_channels, v_channels, num_cross_attention_heads, q_dim = d_latents, kv_dim=d_model, cross_attention_widening_factor, use_query_residual) # i'm including the configuration parameters more to see what parts of the architecture the configuration relates to self_attends = list of PerceiverLayer(is_cross_attention=False, qk_chans, v_chans, num_self_attention_heads, q_dim=d_latents, kv_dim=d_latents, self_attention_widening_factor) ## forward func hidden_states = embeddings layer_outputs = cross_attention(hidden_states, attention_mask, inputs, inputs_mask) layer_outputs = self_attends(layer_outputs[0], attention_mask, layer_head_mask[i]) return layer_outputs[0] It's looks very confusing with all the parameters, but all it's really doing is passing data into a single layer called "cross attention" and then a stack of layers called "self attention" and outputting the result. I'm interested in the masks. These will tell which computation parts to toggle based on what input data is unavailable and stuff. - the attention mask applies to every layer - the inputs mask is passed with the inputs to the cross attention layer - the head masks are specific to each self attention layer How were these passed in for masked language modeling?
From PerceiverModel above:
attention_mask=None, head_mask=head_mask, inputs=inputs, inputs_mask=extended_attention_mask, extended_attention_mask appears to be some transformation of attention_mask passed in. They're passed down from PerceiverForMaskedLM: outputs = self.perceiver( inputs=inputs, attention_mask=attention_mask, head_mask=head_mask, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) There's no inputs mask. There is head mask and an attention mask. So what do these masks provided to PerceiverForMaskedLM do? - Maybe the attention mask disables columns of the stacks of layers, for input data that is nonpresent. - Maybe the head mask does this on a per-layer basis, as a vector. The reason I'm looking at PerceiverForMaskedLM is to consider using perceiver for translation modeling. This would work much simpler on systems with radically disparate but related input and output tokenisations, like the ones I'm engaging in the other thread. T5 has a strange thing where further data is engaged midway through the layer stack. I don't understand it yet, but maybe if it's important, the 'head mask' in PerceiverForMaskedLM could be appropriated to do something similar? I don't know. Maybe the decoder portion, later, is more appropriate.
I typed Step 2 almost fully out here, but the browser window left and it has disappeared. Anyway, in Step 2 the input data Embedding Vectors are fed into PerceiverEncoder as "inputs". PerceiverEncoder mutates them using "cross attention" with the passed "hidden states" which appear to be the "embedding" property of PerceiverModel, left to be trainable in PerceiverForMaskedLM, then passes them through a sequence of "self attention" layers. Step 2: Encoding Embedding vectors + attention mask => PerceiverEncoder.cross_attention -> loop: [hidden states -> PerceiverEncoder.self_attends -> hidden states] -> hidden states l Hidden states are just tensors i.e. ndimensional arrays of numbers. Simplification: Masked embedding vectors -> PerceiverEncoder Cross Attention with Embedding parameters -> PerceiverEncoder Self Attention stacks -> Encoded hidden states But really the masking is done inside PerceiverEncoder.
Figuring out Step 3: Decoding Step 3 is interesting because it takes both the unencoded embedding vectors (inputs) and the encoded hidden states, as input. PerceiverBasicDecoder configuration parameters: output_num_channels = d_latents output_index_dims = max_position_embeddings num_channels = d_model v_channels = d_model qk_channels = 8 * 32 num_heads = 8 use_query_residual = False final_project = False Step 3a: Forming the decoder query Hopefully this is incredibly simple. # decoder_query(inputs): pos_emb = self.output_position_encodings(batch_size) pos_emb = self.positions_projection(pos_emb) pos_emb = [inputs_without_pos, pos_emb] # conditioned on a flag i haven't checked return pos_emb # __init__ output_position_encodings, positions_projection = build_position_encoding('trainable', **) # build_position_encoding return PerceiverTrainablePositionEncoding(), nn.Linear(channels, project_pos_dim) PerceiverTrainablePositionEncoding is just some trainable parameters like before. So basically `decoder_query` takes inputs_without_pos and some trainable parameters and concatenates them together. I'd been leaving inputs_without_pos out. Revisiting PerceiverTextPreprocessor, it looks like None is returned for this output. So, `decoder_query` just returns some trainable parameters that are only conditioned on the input for dimension sizing.
Figuring out Step 3b: Decoding decoder_outputs = self.decoder( query=decoder_query, # looks like just trainable parameters z=sequence_output, # this are the encoded hidden states query_mask=extended_attention_mask, # huh the mask is maybe applied to the query ) # PerceiverBasicDecoder.forward() layer_outputs = decoding_cross_attention( query, attention_mask=query_mask, inputs = z ) logits = final_layer(layer_outputs[0]) return logits # __init__ decoding_cross_attention = PerceiverLayer( is_cross_attention = True, kv_dim = d_latents, **kwparams # the rest of the dimensionality configuration is taken from the call constructing the BasicDecoder ) final_layer = nn.Linear(num_channels, output_num_channels) So, basically the decoder is a cross attention layer just like the first layer in the encoder. The "query" is used for the "hidden states" parameter, and the "inputs" are ferried along to the "inputs" parameter, as if it were an encoder. Just like the encoder, trainable parameters are used for the auxiliary data, and the "inputs" are passed along as the "inputs" data. It would be helpful for me at some point to put time into learning the kqv terminology inside the attention layers. That would make these layers less confusing. Step 3: Encoded hidden state -> PerceiverBasicDecoder cross attention with embedding parameters -> PerceiverBasicDecoder linear redimensioning -> Decoder outputs
Figuring out Step 4: Embedding Decoding logits = self.embedding_decoder(outputs, embedding_layer = perceiver.input_preprocessor.embeddings) The .embeddings proeprty of PerceiverTextPreprocessor (the input_preprocessor) is the matrix of tokens or bytes to embedding vectors. Without the position data. # PerceiverEmbeddingDecoder.forward: def forward(self, hidden_states, embedding_layer): batch_size, seq_len, d_model = hidden_states.shape output = torch.matmul(hidden_states.reshape([-1, d_model]), embedding_layer.weight.T) # Flatten batch dim output = output + self.bias return output.reshape([batch_size, seq_len, self.vocab_size]) Basically, the embedding decoder multiplies its input by the transpose of the embeddings and adds a trainable bias. I'm curious about how transposing the embedding weights undoes their indexing property, but it hopefully follows a mathematical meaning of log probability and matrix multiplication. Step 4: Postprocessing Decoder outputs -> EmbeddingDecoder -> Matrix of log probability vectors for each possible output token
Draft summary of PerceiverForMaskedLM: 1. PerceiverTextPreprocessor: Inputs -> Embeddings 2A. PerceiverEncoder: Embeddings + Weights + Attention mask -> PerdeiverAttention(is_cross_attention = True) -> Hidden states 2B. PerceiverEncoder: Hidden states + Attention mask -> layers of PerceiverAttention(is_cross_attention = False) -> Hidden states 3. PerceiverBasicDecoder: Hidden states + Weights + Attention mask -> PerceiverAttention(is_cross_attention = True) -> Decoded embeddings There's an additional Head mask that may be usable to alter properties of the model on a per-layer basis. 4. PerceiverEmbeddingDecoder: Decoded embeddings -> Log probabilities This is a helpful summary of perceiver models after looking through the code to have an idea where to go to engage parts further.
So, if we were to train a new perceiver model, we'd need some idea of what configuration parameters to provide. How big to make all the different dimensions. I also noted there were some hardcoded parameters in there, likely taken from an initial paper or such. Let's collect this stuff together. 1. PerceiverTextPreprocessor vocab_size -> # token embeddings max_position_embeddings -> # position embeddings d_model -> size of each embedding [d_model -> num_channels property] 2A. PerceiverEncoder Cross Attention qk_channels - > qk_channels v_channels -> v_channels num_cross_attention_heads -> num_heads d_latents -> q_dim d_model -> kv_dim cross_attention_widening_factor -> widening_factor use_query_residual -> use_query_residual 2B. PerceiverEncoder Self Attention qk_channels -> qk_channels v_channels -> v_channels num_self_attention_heads -> num_heads d_latents -> q_dim d_latents -> kv_dim self_attention_widening_factor -> widening_factor 3. PerceiverBasicDecoder 8 * 32 -> qk_channels d_model -> v_channels 8 -> num_heads d_model -> num_channels -> q_dim d_latents -> kv_dim 1 [default] -> widening_factor -> widening_factor False -> use_query_residual max_position_embeddings -> output_index_dims d_latents -> output_num_channels -> size of output if rescaling 4. PerceiverEmbeddingDecoder vocab_size -> vocab_size The hardcoded values for PerceiverBasicDecoder are set in the PerceiverForMaskedLM class. It seems it would likely not be too hard to make them configurable. Okay, so what are all the configuration values and what do they do? ## vocab_size ## - length of the pre- and post- processor embedding matrices - has no impact inside the model - specifies how many different values it can represent in its input and output ## max_position_embeddings ## - width of position embedding matrix - uncertain of the use inside the decoder - has no impact inside the model - this may be the number of values that can be input (or output?) while still including information on their position relative to each other. ## d_model ## - size of each embedding - kv_dim of the encoding cross attention - v_channels and q_dim of the decoding cross attention [qk_channels and num_heads are fixed] ## qk_channels and v_channels ## - qk_channels and v_channels of all encoding attentions - qk_channels of decoding is fixed to 8*32 [v_channels comes from d_model] ## num_*_heads ## - num_heads for encoding cross or self attentions - num_heads for decoding is fixed to 8 ## d_latents ## - q_dim of encoding cross attention [kv_dim comes from d_model] - q_dim and kv_dim of encoding self attentions - kv_dim and output_num_channels of decoder [q_dim comes from d_model] ## *_attention_widening_factor ## - widening factor for encoding cross or self attentions - widening factor for decoding is fixed to 1 ## use_query_residual ## - passed to encoding cross attention - fixed to True for encoding self attention - fixed to False for decoding cross attention
## num_self_attends_per_block ## - number of self attention layers configuration object is transformers.models.perceiver.configuration_perceiver.PerceiverConfig and it briefly documents the parameters (and likely a few more). configs deriving from perceiverconfig can simply set attribute to add more configs the perceiver classes are aliased into transformers at the top level import transformers config = transformers.PerceiverConfig() model = transformers.PerceiverModel(config)
i'm drafting an attempt to use perceiver i thought i'd try to convert between numbers and the words for them the next step is, i need to figure out how to get the ends of the model to match that kind of data. means comprehending the configs, preprocessor, postprocessor/decoder, again. i'm thinking it would be nice to put processors on the ends of the model such that it handles either a single number, or a sequence of characters. but maybe it would make sense for me to start with a simpler task: like a single number to another single number, maybe simplify further by having the numbers be full embedding vectors with 0s for their extra elements.
draft 2 of perceiver_play, doesn't do anything yet maybe a bug but pretends it will. probably needs more data. next interesting parts of the architecture to review are multimodal setups and/or inside attention
this perceiver_play.py actually starts training: https://ipfs.io/ipfs/bafkreibseocnytskgmuyfelgsoeyipvg44w3rcnhhxmvkkhvur3o33... note: you can upload things to ipfs without running a daemon using `npm install -g nftp` and signing in https://nft.storage/ crashes my rasbpi torch (need to rebuild torch or patch more operations in mempickle). also crashed colab somehow, maybe a strange coincidence. hope to test something like it eventually. the training of the pyc2py model in the other thread is halted for now, because colab won't let me connect right now.
On 1/19/22, k <gmkarl@gmail.com> wrote:
this perceiver_play.py actually starts training: https://ipfs.io/ipfs/bafkreibseocnytskgmuyfelgsoeyipvg44w3rcnhhxmvkkhvur3o33...
note: you can upload things to ipfs without running a daemon using `npm install -g nftp` and signing in https://nft.storage/
https://docs.ipfs.io/how-to/mint-nfts-with-ipfs/ https://www.pinata.cloud/
changed model parameters to prevent crashing https://ipfs.io/ipfs/bafkreifcigd3xjdjchszlggambxvlsprumwqi5kzxtm5jaue7gfkfx... i reduced them by a lot but i kind of suspect it's still swapping out while running on my raspberry pi these models are intended to be used on huge machines. perceiver sets things up so you can shrink the data while processing. it's still probably not intended to be run on a raspberry pi.
trains with reducing loss, even on rasbpi, but the outputs don't differentiate yet due to some problem. every output byte is the same value: https://ipfs.io/ipfs/bafkreig34njqveooj4yrx3qpyttrvimxyibvtysqegy73yi2e4vfej...
i looked at the perceiver running a bit. i tried reducing the learning rate, which is way too high. i'm not sure why the characters trend to all the same values (even at the start). reviewing the data flow might be helpful, but it might take a few reviews through different parts to eventually find the issue. experience in transformer models would likely help. my indication that something is wired up wrongly is that the values trend to early 0, whereas if the attention mask is being respected there shouldn't be pull toward 0 because most of the 0s are masked out.
this version has all the outputs fixed to a single value: https://bafkreic6ulo7ahnblyxdrklw7lcshmaob4yfawznnprrdis4nop7lw6nxu.ipfs.dwe... it's quite clear that the model isn't differentiating between output positions. maybe this has to do with the decoder, which included output position encodings. the problem still seemed to occur when i used the [huggingface language model class rather than my own] which could mean i am using it wrongly. it could make sense to compare the decoder with google's original language model decoder, and see how they use it: https://github.com/deepmind/deepmind-research/tree/master/perceiver
I wrote up some about reviewing google's masked language modeling implementation, but lost the writing during a spasm, despite saving drafts. Anyway, I think the most useful approach will be to port google's example pretrained model to pytorch, and see how it runs. I glanced at huggingface's test for this use case, and the test doesn't actually verify the output data yet, so they could have a bug. Google's example model is at https://storage.googleapis.com/perceiver_io/language_perceiver_io_bytes.pick... I guess I'll build haiku for arm64, since the pickle uses haiku objects.
pip3 install git+https://github.com/deepmind/dm-haiku
import pickle; params = pickle.load(open('language_perceiver_io_bytes.pickle','rb')) params.keys()
the haiku weights look very similar to pytorch weights. a dictionary, where each key is a path to tensor in a model. they likely map 1-to-1 if the architectures are the same. I typed a bunch more but lost it in more spasms. The above text came back when I reopened the last closed tab. Parameter name guesses from reviewing the apply_perciever function that defines the model in masked_language_modelling.ipynb using pip3 install jupytext: huggingface's TextPreprocessor: 'embed' and 'trainable_position_encoding'
PerceiverEncoder: perceiver_encoder BasicDecoder: basic_decoder EmbeddingDecoder: embedding_decoder so 5 objects at the root level that map to huggingface objects in order. i'll try to instantiate a huggingface model with the same configuration and compare the names: unclear configuration values, unsure how to map: encoder.z_index_dim=256 encoder.num_z_channels=d_latents config values needing setting from the transformers.PerceiverConfig() defaults: config.qk_channels = 8 * 32 config.v_channels = config.d_latents in google's example, the decoder is configurable: output_num_channels = config.d_latents position_encoding_type = 'trainable' output_index_dims = config.max_position_embeddings num_z_channels = config.d_latents qk_channels = 8*32 v_channels = config.d_model num_heads = 8 final_project = False trainable_position_encoding_kwargs={num_channels: config.d_model} huggingface likely copied google's code, so i may be able to just instantiate a model and have it look like google's example already On 1/20/22, k <gmkarl@gmail.com> wrote:
pip3 install git+https://github.com/deepmind/dm-haiku
import pickle; params = pickle.load(open('language_perceiver_io_bytes.pickle','rb')) params.keys()
the haiku weights look very similar to pytorch weights. a dictionary, where each key is a path to tensor in a model. they likely map 1-to-1 if the architectures are the same.
I typed a bunch more but lost it in more spasms. The above text came back when I reopened the last closed tab.
Parameter name guesses from reviewing the apply_perciever function that defines the model in masked_language_modelling.ipynb using pip3 install jupytext:
huggingface's TextPreprocessor: 'embed' and 'trainable_position_encoding'
i wrote down some of the weight names to help me think. the haiku weights are in a nested structure and are only named based on their neural network model type. so matching them will mean reviewing more than their names, maybe their order of construction and use in google's source compared to the huggingface source def haiku2torch(haiku_params): haiku_params = {**haiku_params} state_dict = {} state_dict['perceiver.input_preprocessor.embeddings.weight'] = haiku_params.pop('embed') state_dict['perceiver.input_preprocessor.position_embeddings.weight'] = haiku_params.pop('trainable_position_encoding') haiku_params['perceiver_encoder/~/cross_attention/attention/linear']['w'] ? state_dict['perceiver.encoder.cross_attention.attention.self.layernorm1.weight'] state_dict['perceiver.encoder.cross_attention.attention.self.layernorm1.bias'] state_dict['perceiver.encoder.cross_attention.attention.self.layernorm2.weight'] state_dict['perceiver.encoder.cross_attention.attention.self.layernorm2.bias'] state_dict['perceiver.encoder.cross_attention.attention.self.query.weight'] state_dict['perceiver.encoder.cross_attention.attention.self.query.bias'] state_dict['perceiver.encoder.cross_attention.attention.self.key.weight'] state_dict['perceiver.encoder.cross_attention.attention.self.key.bias'] state_dict['perceiver.encoder.cross_attention.attention.self.value.weight'] state_dict['perceiver.encoder.cross_attention.attention.self.value.bias'] state_dict['perceiver.encoder.cross_attention.attention.output.dense.weight'] state_dict['perceiver.encoder.cross_attention.attention.output.dense.bias'] state_dict['perceiver.encoder.cross_attention.attention.layernorm.weight'] state_dict['perceiver.encoder.cross_attention.attention.layernorm.bias'] state_dict['perceiver.encoder.cross_attention.attention.mlp.dense1.weight'] state_dict['perceiver.encoder.cross_attention.attention.mlp.dense1.bias'] state_dict['perceiver.encoder.cross_attention.attention.mlp.dense2.weight'] state_dict['perceiver.encoder.cross_attention.attention.mlp.dense2.bias'] state_dict['perceiver.embeddings.latents'] ?
- there's already a function to convert these models in the perceiver transformers subfolder - once you get the tokenizer right (byte + 6), torch runs google's example fine - there's some code in that file that sets two additional model parameters i wasn't setting (config.qk_channels = 8 *32, config.v_channels = config.d_latents) - i eventually went to google's original paper and looked at their training parameters for masked language modeling. their learning rate was 0.00125. mine was 0.0001. - when i set my learning rate to 0.001, with the config change, the model now learns a constant output somewhat quickly. I think i also changed from the SGD optimizer to Adam. the paper used Lamb with a learning rate warmup, cosine cycle, and weight decay. - i think ended up trying with a large batch size (256 or 512) and small config parameters (depth=3, ~width=256), and the model got stuck around in the 2's and then suddenly burst up to 4 and dropped down to 1.3 and was outputting numbers of roughly the right length with often the right first digit, but wouldn't proceed further. i didn't note these parameters and couldn't reproduce it. - after fiddling with things a bit on colab's gpu, this is the first set of parameters i found to solve the problem, around step 2000: https://bafkreie7gyyy3alribjyl72hlm4pk4allyul7xem7yqmpl66yzcidumfnq.ipfs.dwe... . i think it could do it faster.
regarding picking model size, i'm vaguely considering starting with small models, and then duplicating the data and doubling the size, putting the old data at the start. the assumption is that small models find rough information that could be useful as input to larger models, and that the double width will provide for new information to still progress down the model - could make sense to test that approach with this number problem. still not sure how to do learning rate effectively, kind of want to make train an optimizer out of a perceiver to have it handle the problem.
https://bafkreie7gyyy3alribjyl72hlm4pk4allyul7xem7yqmpl66yzcidumfnq.ipfs.dwe...
I modified this to save the trained model with a line at the end of "model.save_pretrained('words2nums')" which makes a folder and pickles the trained parameters into it, in a torch-specific zip format. I also changed the batch size to 128 when I ran it. I think it was performing well with a much smaller batchsize. It went to step 3850 or so to go through all 500k data. The loss doesn't drop below 0.08 or so - maybe it would perform better with a learning rate that reduced over time - but it gets the answers right, it just doesn't reach 100% certainty. nftp has stopped working for me on colab. When I `npm install -g nftp` a dependency error crops up and terminates the script.
i'm working on the matrix permutations inside huggingface perceiver and the major torch implementation of efficient attention. i have two scripts to call them, to step through and map the offsets. it's very hard for me to think about the axis permutations. scripts are attached. perceiver_loader.py also functions as an interactive model loader for the model generated by the line in the previous email. using it on a trained model, one can see that the training fits to many thousands of numbers but still fails on rare numbers especially small numbers which only have so many examples in the data. it also fails if the data input format changes, such as adding the word 'and' or a hyphen.
these are my incomplete model permutation notes, for inside the attention implementations. each axis is labeled with an einsum letter. chunked: queries: ...qhd keys: ...khd values: ...vhd mask: ...hqk -> ...qhk scores: ...qhk unchunked: queries: .hqd keys: .hkd values: .hvd mask: . scores: .
aaand ... the notes regard commit a021666abab736b7d98cd3d74712601bcf3aedf4 of https://github.com/xloem/transformers in the memory-efficient-attention branch, commit message 'wip'
the data comes out right now until it's consolidated at the end of the softmax i stepped through it carefully, and it turns out the attention values are being generated in a truncated manner. there are only 20 in the efficient_attention code, whereas there are 96 in the working code. so, i still got something wrong. i'm guessing my test passed more easily because it had the same feature size for all of queries, keys, and values. that is not true in the perceiver_loader test i'm pursuing; i think it looks as if the values have a feature size of 20 whereas the keys have a feature size of 96. gotta review again to get that making 96 attention scores instead of 20, I guess. unsure. here are the notes with the einsum letters flushed out, unchecked: chunked: queries: ...qhd keys: ...khd values: ...vhd scores: ...qhk mask: ...hqk -> ...qhk unchunked: queries: .hqd keys: .hkd values: .hvd scores: .hqk -> 1,8,256,96 mask: .hqk -> 1,1,1,96 -> needs extension to num_heads, num_queries commit eb16dc63d2c617bfe708881f8bd5ba96be8b9f50 (HEAD -> memory-efficient-attention, xloem/memory-efficient-attention) Author: xloem <0xloem@gmail.com> Date: Thu Jan 27 11:43:45 2022 +0000 wip efficient attention: dimensions pass but data is truncated
i'm guessing the truncation happens in the chunking code. maybe there is a permutation of the tensors that produces the right results but with the wrong variable names, that i've stumbled upon.
the big issue wasn't truncation; it was that i had put the wrong block of code in an if/else condition. current challenge is that the efficient attention implementation doesn't provide for applying dropout (random zeroing of some weights during training) where the perceiver model applies it. i fudged something in, untested. commit 7628f3e4f32ac25b11774d939f2e16a20dd2a8fd (HEAD -> memory-efficient-attention, xloem/memory-efficient-attention) Author: xloem <0xloem@gmail.com> Date: Thu Jan 27 12:38:13 2022 +0000 wip efficient attention: organising separate parts to include dropout and application
this commit finally produces the correct perceiver output without bailing. i might like to do the gpt2 model next. since i ran into a lot of unexpected difficulties here. commit ab72f4a6a2a9095587b02c262ae1b20801172315 (HEAD -> memory-efficient-attention, xloem/memory-efficient-attention) Author: xloem <0xloem@gmail.com> Date: Thu Jan 27 13:14:23 2022 +0000 handle missing attention mask, add code for head_mask, comment out debugging break
i've forked memory-efficient-attention in an attempt to add a return_weights parameter. i think the torch implementation of this would be simplified by using a for loop rather than a scan function parameterised by a callback. https://github.com/xloem/memory-efficient-attention/commits/return_weights Author: xloem <0xloem@gmail.com> Date: Thu Jan 27 14:50:32 2022 +0000 wip: needs a change so return_weights output is migrated through scan() the reason for this is because transformers has a return_weights configuration, where the pre-softmax weights of attention passes are returned to the user from the library. supporting that means getting inside attention somehow. i experience pressure to cover less expanding work. ideas for reducing the steps for this part include: - simply disabling return_weights in transformers if efficient attention is engaged - writing a transformers-specific implementation of efficient attention but i'll probably open an issue in the repository and plan to move forward on a pull request
https://github.com/AminRezaei0x443/memory-efficient-attention/issues/1 feat: return_weights #1 xloem opened this issue 36 minutes ago I'm looking into hacking some of the models in the transformers library to use this library for attention, and I don't see a way to support return_weights yet. This is a flag passed in transformers, where the pre-softmax attention weights are preserved and returned to the user, if it is set. I looked a little at implementing this in the torch backend, and I note the scan() function provides for only a single tensor return value. It seems to me that scan() function would be most clearly replaced by a for loop, but it could also be modified to handle tuples, or return_weights could be handled via accessing nonlocal data in some way instead of returning them through the chunk scanner. commit 63fb607e6e0ed41934fbc1e1e150993b261903eb (HEAD -> return_weights, origin/return_weights) Author: xloem <0xloem@gmail.com> Date: Thu Jan 27 15:37:09 2022 +0000 reimplemented using an output parameter. not passing test yet
commit 899f4a6781537568b9b1b51250e7410c06716e9c Author: xloem <0xloem@gmail.com> Date: Thu Jan 27 16:47:08 2022 +0000 change dynamic_slice to reference passed data. return_weights test now passes on torch. commit 8de9d4c305e1763b2d0e90928b68d155ed60426c (HEAD -> return_weights, origin/return_weights) Author: xloem <0xloem@gmail.com> Date: Thu Jan 27 17:07:27 2022 +0000 draft of a sibling jax implementation for return_weights. test does not pass.
edited issue text, i think the flag is called 'output_attentions' or something, not 'return_weights': feat: output_attentions #1 I'm looking into hacking some of the models in the transformers library to use this library for attention, and I don't see a way to support `output_attentions` yet. This is a flag passed in transformers, where the pre-softmax attention weights are preserved and returned to the user, if it is set. I looked a little at implementing this in the torch backend, and I note the scan() function provides for only a single tensor return value. It seems to me that scan() function would be most clearly replaced by a for loop, but it could also be modified to handle tuples, or return_weights could be handled via accessing nonlocal data in some way instead of returning them through the chunk scanner. I'm also not sure how the output would best be passed to the user. I'm thinking it might make the most sense to provide for an optional output parameter, although I don't really know.
ok, the jax implementation will need a little rejiggering because jax arrays are immutable, so passing one as an output parameter does not provide for output. grarpamp's description of a global code repository was really inspiring. i'm wondering where and how to do work on it, like some censorship-resistant dev community. does gitcoin have a p2p interface? but maybe it doesn't matter. maybe we just need to keep trying until it happens.
both tests passing, now to normalise things a bit. might be a little hard to rewrite the torch scan() shim to match the approach that works with jax commit b60bc067e16c717fc6632d862f1de275007aa47e (HEAD -> return_weights, origin/return_weights) Date: Fri Jan 28 05:43:46 2022 +0000 jax return_weights working, commented draft statements left in source, implementations not normalised
I guess I'd better test using this in some way before opening a pull request, which would likely mean code in transformers that uses it. I was thinking of adding it to the gpt-j model instead of gpt2. It's more useful and the attention code actually appears much simpler. https://github.com/AminRezaei0x443/memory-efficient-attention/compare/main..... commit faba6371ac7faaa2040a2c26e15ae7ab87f94ce4 (HEAD -> return_weights, origin/return_weights) Date: Fri Jan 28 06:37:41 2022 +0000 mostly normalised return_attentions implementations between backends, tests pass
The appointment I was delaying work on this for was canceled, but my parts are still quite scared to continue. Still, I'm looking at the gptj attention code this morning, with plan to work more. My parts relate that we experience further harm when updates are spread. We want to share badly, but we can experience being hurt when we do. [i find it easier when the sharing is automatic. some of the parts are practicing being strong.]
[this description of 'parts' was inaccurate and misleading. people don't know what we mean, they're not us.]
i hear we sure don't understand each other and that that's one of the most important things there is! i hope we don't get hurt! this thread is tagged [spam] like pretty much all of my threads, because it [doesn't include what i really want to say] and [does include things i don't want to say]. it's tagged [personal] too of course, cause perceiver is a hobby goal i've held personally. commit 53b6f54086121aaf6e6a54208ba5a9ff41141e88 (HEAD -> memory-efficient-attention, xloem/memory-efficient-attention) Author: xloem <0xloem@gmail.com> Date: Fri Jan 28 14:00:11 2022 +0000 drafted use of memory-efficient-attention for gptj. presently crashes when run due to incompleteness
participants (5)
-
grarpamp
-
k
-
Karl Semich
-
Undiscussed Horrific Abuse, One Victim & Survivor of
-
Undiscussed Horrific Abuse, Victim & Survivor of