import inspect print(inspect.getsource(s2t.model.generate)) It looks like this speech-to-text model is done just like the text-based language modeling that got big around openai. It calculates each next token, one after the other. This basically means that it shouldn't be hard to convert it to an arbitrary length simply writing a new generation loop and shifting off old input. It also means that it will likely output confidence around every possible token it could output. So if the right tokens are represented, we could grow the parts of the model that produce them, by backpropagating loss that represents this. This model isn't great for this overall approach because its tokenizer is focused on conversational speech rather than symbols and keywords from source code. That's mostly what you're going to find nowadays, is models focused on conversational speech.