Figuring out Step 3: Decoding Step 3 is interesting because it takes both the unencoded embedding vectors (inputs) and the encoded hidden states, as input. PerceiverBasicDecoder configuration parameters: output_num_channels = d_latents output_index_dims = max_position_embeddings num_channels = d_model v_channels = d_model qk_channels = 8 * 32 num_heads = 8 use_query_residual = False final_project = False Step 3a: Forming the decoder query Hopefully this is incredibly simple. # decoder_query(inputs): pos_emb = self.output_position_encodings(batch_size) pos_emb = self.positions_projection(pos_emb) pos_emb = [inputs_without_pos, pos_emb] # conditioned on a flag i haven't checked return pos_emb # __init__ output_position_encodings, positions_projection = build_position_encoding('trainable', **) # build_position_encoding return PerceiverTrainablePositionEncoding(), nn.Linear(channels, project_pos_dim) PerceiverTrainablePositionEncoding is just some trainable parameters like before. So basically `decoder_query` takes inputs_without_pos and some trainable parameters and concatenates them together. I'd been leaving inputs_without_pos out. Revisiting PerceiverTextPreprocessor, it looks like None is returned for this output. So, `decoder_query` just returns some trainable parameters that are only conditioned on the input for dimension sizing.