write a transformer layer in vector pseudocode, doing out the arithmetic fully for all machine learning operators https://arweave.net/DAvmRUl5yNsmv62NPTp3xR7S33s4RfgaGkWTWzVxq-c#minblockheight=1150056&minblockhash=cL5GTqicaEZmO9grHVniVFR8nmkIF7Odkc8RfcarNdY6TRy7afW6cXJGis3qNcWb A transformer layer consists of a multi-head self-attention mechanism followed by a position-wise feed-forward network. Here's the vector pseudocode for a transformer layer: 1. Input: X (input matrix of shape [batch_size, seq_length, d_model]) 2. Multi-head self-attention: a. Initialize weight matrices W_Q, W_K, W_V of shape [d_model, d_k], and W_O of shape [num_heads * d_k, d_model] b. For each head i in num_heads: i. Compute query matrix Q_i = X * W_Q_i ii. Compute key matrix K_i = X * W_K_i iii. Compute value matrix V_i = X * W_V_i iv. Compute attention scores S_i = softmax(Q_i * K_i^T / sqrt(d_k)) v. Compute attention output A_i = S_i * V_i c. Concatenate all attention outputs: A = [A_1, A_2, ..., A_num_heads] d. Compute final multi-head attention output: MHA = A * W_O 3. Add and normalize: a. Add input and multi-head attention output: Y = X + MHA b. Normalize Y: Y_norm = LayerNorm(Y) 4. Position-wise feed-forward network: a. Initialize weight matrices W_1 of shape [d_model, d_ff], W_2 of shape [d_ff, d_model], and biases b_1, b_2 b. Compute intermediate representation: Z = ReLU(Y_norm * W_1 + b_1) c. Compute feed-forward output: FF = Z * W_2 + b_2 5. Add and normalize: a. Add normalized input and feed-forward output: Y' = Y_norm + FF b. Normalize Y': Y'_norm = LayerNorm(Y') 6. Output: Y'_norm (output matrix of shape [batch_size, seq_length, d_model]) Note that this pseudocode assumes a single transformer layer with a specified number of attention heads (num_heads), dimensions of the input embeddings (d_model), dimensions of the key, query, and value matrices (d_k), and dimensions of the position-wise feed-forward network (d_ff). In practice, multiple transformer layers are stacked to form the complete transformer architecture.