
- a large pretrained model that has significant understanding of english logic and knowledge could be finetuned on bytes by training perceiver-like cross attention embedding/tokenization encoders and decoders to match the behaviors if its original tokenizer and embeddings but accept byte streams. - the perceiver masked lm model uses cross attention as such: PerceiverLayer(config, is_cross_attention=True, qk_channels, v_channels, num_heads, q_dim, kv_dim, widening_factor=config.widening_factor, use_query_residual=config.use_query_residual) and calls it as: cross_attention(trained_embeddings, attention_mask=None, head_mask=None, inputs=inputs, inputs_mask=inputs_mask), roughly. - I'm curious what kind of memory and computation bounds there are on data input size for the mainstream trained models. Could we feed an entire binary in? Could we feed an entire tarball in? - I'm curious what the state of the large-input models is, like bigbird. Are they helpful here? - I'd also like to run the current model on colab by finding a workable trick to prevent the compilation crash, by either compiling in smaller chunks or using a different framework, possibly without compilation.