Figuring out Step 2: Encoding Here's a copy paste of this call: embedding_output = self.embeddings(batch_size=batch_size) encoder_outputs = self.encoder( embedding_output, attention_mask=None, head_mask=head_mask, inputs=inputs, inputs_mask=extended_attention_mask, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) It collects the inputs with self.embeddings and produces a tensor. self.embeddings is a PerceiverEmbeddings which is just a matrix of parameters that the training process can mutate. self.encoder is a PerceiverEncoder. It's initialised with a kv_dim parameter that is taken from the input preprocessor: self.encoder = PerceiverEncoder( config, kv_dim=input_preprocessor.num_channels if input_preprocessor is not None else config.d_model ) Here's num_channels in PerceiverTextPreprocessor: @property def num_channels(self) -> int: return self.config.d_model So kv_dim is the `d_model` model configuration parameter. Maybe a way to describe how complex the model is. On to PerceiverEncoder: ## construction cross_attention = PerceiverLayer(is_cross_attention=True, qk_channels, v_channels, num_cross_attention_heads, q_dim = d_latents, kv_dim=d_model, cross_attention_widening_factor, use_query_residual) # i'm including the configuration parameters more to see what parts of the architecture the configuration relates to self_attends = list of PerceiverLayer(is_cross_attention=False, qk_chans, v_chans, num_self_attention_heads, q_dim=d_latents, kv_dim=d_latents, self_attention_widening_factor) ## forward func hidden_states = embeddings layer_outputs = cross_attention(hidden_states, attention_mask, inputs, inputs_mask) layer_outputs = self_attends(layer_outputs[0], attention_mask, layer_head_mask[i]) return layer_outputs[0] It's looks very confusing with all the parameters, but all it's really doing is passing data into a single layer called "cross attention" and then a stack of layers called "self attention" and outputting the result. I'm interested in the masks. These will tell which computation parts to toggle based on what input data is unavailable and stuff. - the attention mask applies to every layer - the inputs mask is passed with the inputs to the cross attention layer - the head masks are specific to each self attention layer How were these passed in for masked language modeling?
From PerceiverModel above:
attention_mask=None, head_mask=head_mask, inputs=inputs, inputs_mask=extended_attention_mask, extended_attention_mask appears to be some transformation of attention_mask passed in. They're passed down from PerceiverForMaskedLM: outputs = self.perceiver( inputs=inputs, attention_mask=attention_mask, head_mask=head_mask, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) There's no inputs mask. There is head mask and an attention mask. So what do these masks provided to PerceiverForMaskedLM do? - Maybe the attention mask disables columns of the stacks of layers, for input data that is nonpresent. - Maybe the head mask does this on a per-layer basis, as a vector. The reason I'm looking at PerceiverForMaskedLM is to consider using perceiver for translation modeling. This would work much simpler on systems with radically disparate but related input and output tokenisations, like the ones I'm engaging in the other thread. T5 has a strange thing where further data is engaged midway through the layer stack. I don't understand it yet, but maybe if it's important, the 'head mask' in PerceiverForMaskedLM could be appropriated to do something similar? I don't know. Maybe the decoder portion, later, is more appropriate.