To find models trained on other data I used https://huggingface.co/datasets?task_ids=task_ids:automatic-speech-recognition&sort=downloads to find other datasets in the same repository. Here's one that uses commonvoice: https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english I'm worried, because it is simply a finetuning of a model trained on the same data, to different data. This is not very different. A different approach than two models with different training sets may be more interesting if it's hard to find such things. I'm also worried that the model is much larger, and this could really slow down development. Noting that it's probably not hard to reproduce their finetuning work on a smaller model. It is good to develop the experience of using more than one model to do this. But I may not be completing this challenge if I get distracted. It's unfortunate my planned approach is seeming difficult to me. I'll try this other model, see if it's reasonable to try. Here's the example from the model page: import torch import librosa from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor LANG_ID = "en" MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english" SAMPLES = 10 test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]") processor = Wav2Vec2Processor.from_pretrained(MODEL_ID) model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID) # Preprocessing the datasets. # We need to read the audio files as arrays def speech_file_to_array_fn(batch): speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000) batch["speech"] = speech_array batch["sentence"] = batch["sentence"].upper() return batch test_dataset = test_dataset.map(speech_file_to_array_fn) inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits predicted_ids = torch.argmax(logits, dim=-1) predicted_sentences = processor.batch_decode(predicted_ids) for i, predicted_sentence in enumerate(predicted_sentences): print("-" * 100) print("Reference:", test_dataset[i]["sentence"]) print("Prediction:", predicted_sentence)