So, if we were to train a new perceiver model, we'd need some idea of what configuration parameters to provide. How big to make all the different dimensions. I also noted there were some hardcoded parameters in there, likely taken from an initial paper or such. Let's collect this stuff together. 1. PerceiverTextPreprocessor vocab_size -> # token embeddings max_position_embeddings -> # position embeddings d_model -> size of each embedding [d_model -> num_channels property] 2A. PerceiverEncoder Cross Attention qk_channels - > qk_channels v_channels -> v_channels num_cross_attention_heads -> num_heads d_latents -> q_dim d_model -> kv_dim cross_attention_widening_factor -> widening_factor use_query_residual -> use_query_residual 2B. PerceiverEncoder Self Attention qk_channels -> qk_channels v_channels -> v_channels num_self_attention_heads -> num_heads d_latents -> q_dim d_latents -> kv_dim self_attention_widening_factor -> widening_factor 3. PerceiverBasicDecoder 8 * 32 -> qk_channels d_model -> v_channels 8 -> num_heads d_model -> num_channels -> q_dim d_latents -> kv_dim 1 [default] -> widening_factor -> widening_factor False -> use_query_residual max_position_embeddings -> output_index_dims d_latents -> output_num_channels -> size of output if rescaling 4. PerceiverEmbeddingDecoder vocab_size -> vocab_size The hardcoded values for PerceiverBasicDecoder are set in the PerceiverForMaskedLM class. It seems it would likely not be too hard to make them configurable. Okay, so what are all the configuration values and what do they do? ## vocab_size ## - length of the pre- and post- processor embedding matrices - has no impact inside the model - specifies how many different values it can represent in its input and output ## max_position_embeddings ## - width of position embedding matrix - uncertain of the use inside the decoder - has no impact inside the model - this may be the number of values that can be input (or output?) while still including information on their position relative to each other. ## d_model ## - size of each embedding - kv_dim of the encoding cross attention - v_channels and q_dim of the decoding cross attention [qk_channels and num_heads are fixed] ## qk_channels and v_channels ## - qk_channels and v_channels of all encoding attentions - qk_channels of decoding is fixed to 8*32 [v_channels comes from d_model] ## num_*_heads ## - num_heads for encoding cross or self attentions - num_heads for decoding is fixed to 8 ## d_latents ## - q_dim of encoding cross attention [kv_dim comes from d_model] - q_dim and kv_dim of encoding self attentions - kv_dim and output_num_channels of decoder [q_dim comes from d_model] ## *_attention_widening_factor ## - widening factor for encoding cross or self attentions - widening factor for decoding is fixed to 1 ## use_query_residual ## - passed to encoding cross attention - fixed to True for encoding self attention - fixed to False for decoding cross attention