I typed Step 2 almost fully out here, but the browser window left and it has disappeared. Anyway, in Step 2 the input data Embedding Vectors are fed into PerceiverEncoder as "inputs". PerceiverEncoder mutates them using "cross attention" with the passed "hidden states" which appear to be the "embedding" property of PerceiverModel, left to be trainable in PerceiverForMaskedLM, then passes them through a sequence of "self attention" layers. Step 2: Encoding Embedding vectors + attention mask => PerceiverEncoder.cross_attention -> loop: [hidden states -> PerceiverEncoder.self_attends -> hidden states] -> hidden states l Hidden states are just tensors i.e. ndimensional arrays of numbers. Simplification: Masked embedding vectors -> PerceiverEncoder Cross Attention with Embedding parameters -> PerceiverEncoder Self Attention stacks -> Encoded hidden states But really the masking is done inside PerceiverEncoder.