[spam] [personal] perceiver model notes

Tue Jan 18 08:38:23 PST 2022

So, if we were to train a new perceiver model, we'd need some idea of
what configuration parameters to provide.  How big to make all the
different dimensions.

I also noted there were some hardcoded parameters in there, likely
taken from an initial paper or such.

Let's collect this stuff together.

1. PerceiverTextPreprocessor
vocab_size -> # token embeddings
max_position_embeddings -> # position embeddings
d_model -> size of each embedding
[d_model -> num_channels property]

2A. PerceiverEncoder Cross Attention
qk_channels - > qk_channels
v_channels -> v_channels
num_cross_attention_heads -> num_heads
d_latents -> q_dim
d_model -> kv_dim
cross_attention_widening_factor -> widening_factor
use_query_residual -> use_query_residual

2B. PerceiverEncoder Self Attention
qk_channels -> qk_channels
v_channels -> v_channels
num_self_attention_heads -> num_heads
d_latents -> q_dim
d_latents -> kv_dim
self_attention_widening_factor -> widening_factor

3. PerceiverBasicDecoder
8 * 32 -> qk_channels
d_model -> v_channels
8 -> num_heads
d_model -> num_channels -> q_dim
d_latents -> kv_dim
1 [default] -> widening_factor -> widening_factor
False -> use_query_residual
max_position_embeddings -> output_index_dims
d_latents -> output_num_channels -> size of output if rescaling

4. PerceiverEmbeddingDecoder
vocab_size -> vocab_size

The hardcoded values for PerceiverBasicDecoder are set in the
PerceiverForMaskedLM class.  It seems it would likely not be too hard
to make them configurable.

Okay, so what are all the configuration values and what do they do?

## vocab_size ##

- length of the pre- and post- processor embedding matrices

- has no impact inside the model
- specifies how many different values it can represent in its input and output

## max_position_embeddings ##

- width of position embedding matrix
- uncertain of the use inside the decoder

- has no impact inside the model
- this may be the number of values that can be input (or output?)
while still including information on their position relative to each
other.

## d_model ##

- size of each embedding
- kv_dim of the encoding cross attention
- v_channels and q_dim of the decoding cross attention [qk_channels
and num_heads are fixed]

## qk_channels and v_channels ##

- qk_channels and v_channels of all encoding attentions
- qk_channels of decoding is fixed to 8*32 [v_channels comes from d_model]

## num_*_heads ##

- num_heads for encoding cross or self attentions
- num_heads for decoding is fixed to 8

## d_latents ##

- q_dim of encoding cross attention [kv_dim comes from d_model]
- q_dim and kv_dim of encoding self attentions
- kv_dim and output_num_channels of decoder [q_dim comes from d_model]

## *_attention_widening_factor ##

- widening factor for encoding cross or self attentions
- widening factor for decoding is fixed to 1

## use_query_residual ##

- passed to encoding cross attention
- fixed to True for encoding self attention
- fixed to False for decoding cross attention