Re: [spam] [personal] perceiver model notes

18 Jan 2022

      Figuring out Step 2: Encoding

Here's a copy paste of this call:

        embedding_output = self.embeddings(batch_size=batch_size)

        encoder_outputs = self.encoder(
            embedding_output,
            attention_mask=None,
            head_mask=head_mask,
            inputs=inputs,
            inputs_mask=extended_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

It collects the inputs with self.embeddings and produces a tensor.
self.embeddings is a PerceiverEmbeddings which is just a matrix of
parameters that the training process can mutate.

self.encoder is a PerceiverEncoder.  It's initialised with a kv_dim
parameter that is taken from the input preprocessor:

        self.encoder = PerceiverEncoder(
            config, kv_dim=input_preprocessor.num_channels if
input_preprocessor is not None else config.d_model
        )

Here's num_channels in PerceiverTextPreprocessor:
    @property
    def num_channels(self) -> int:
        return self.config.d_model

So kv_dim is the `d_model` model configuration parameter.  Maybe a way
to describe how complex the model is.

On to PerceiverEncoder:

## construction

cross_attention = PerceiverLayer(is_cross_attention=True, qk_channels,
v_channels, num_cross_attention_heads, q_dim = d_latents,
kv_dim=d_model, cross_attention_widening_factor, use_query_residual)

# i'm including the configuration parameters more to see what parts of
the architecture the configuration relates to

self_attends = list of PerceiverLayer(is_cross_attention=False,
qk_chans, v_chans, num_self_attention_heads, q_dim=d_latents,
kv_dim=d_latents, self_attention_widening_factor)

## forward func
hidden_states = embeddings
layer_outputs = cross_attention(hidden_states, attention_mask, inputs,
inputs_mask)
layer_outputs = self_attends(layer_outputs[0], attention_mask,
layer_head_mask[i])
return layer_outputs[0]

It's looks very confusing with all the parameters, but all it's really
doing is passing data into a single layer called "cross attention" and
then a stack of layers called "self attention" and outputting the
result.

I'm interested in the masks.  These will tell which computation parts
to toggle based on what input data is unavailable and stuff.
- the attention mask applies to every layer
- the inputs mask is passed with the inputs to the cross attention layer
- the head masks are specific to each self attention layer

How were these passed in for masked language modeling?
...
From PerceiverModel above:
attention_mask=None,
            head_mask=head_mask,
            inputs=inputs,
            inputs_mask=extended_attention_mask,

extended_attention_mask appears to be some transformation of
attention_mask passed in.

They're passed down from PerceiverForMaskedLM:

         outputs = self.perceiver(
            inputs=inputs,
            attention_mask=attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

There's no inputs mask.  There is head mask and an attention mask.

So what do these masks provided to PerceiverForMaskedLM do?

- Maybe the attention mask disables columns of the stacks of layers,
for input data that is nonpresent.
- Maybe the head mask does this on a per-layer basis, as a vector.

The reason I'm looking at PerceiverForMaskedLM is to consider using
perceiver for translation modeling.  This would work much simpler on
systems with radically disparate but related input and output
tokenisations, like the ones I'm engaging in the other thread.  T5 has
a strange thing where further data is engaged midway through the layer
stack.  I don't understand it yet, but maybe if it's important, the
'head mask' in PerceiverForMaskedLM could be appropriated to do
something similar?  I don't know.  Maybe the decoder portion, later,
is more appropriate.