oh and for completing the vocab: I suppose you'd pass the whole recording through the original model and then add the output to the tokenizer. it'll get enough words I think.
then the random noise stuff would preserve the non-code text
but more interesting to map state machines to logit matrices