In an attempt to address the length issue I visited the model from the example, and it looks like the helpful information is in its config.json files at https://huggingface.co/facebook/s2t-small-librispeech-asr/tree/main . In preprocessor_config.json we have feature_size = 80 and in config.json we have max_input_features = 6000 or something similar. I don't know for sure what those are: i know i need the size of the input position embeddings to the model, and the downsampling ratio of the tokenizer. I'm guessing those are max_input_features and feature_size configs, or such. When I change the chunksize from 128 * 1024 to 80 * 600, I get much better output. It does still begin making errors when it gets to the end of the chunk, where the logo code is likely to start: ["and here we want to show you that you can program a picture right along with us we'll use a single color some unorthodox functions and each line we'll put a bit of nature's masterpieces right here on our canvas to day we'll have them run all the functions across the stream right now that you need to program along with us starting with the simple one to dist colin why zero coal and one"] Next: add a different model, hopefully one trained on different data