[ot][spam][crazy] Quickly autotranscribing xkcd 4/1 correctly

Sat Apr 2 04:08:10 PDT 2022

To find models trained on other data I used
https://huggingface.co/datasets?task_ids=task_ids:automatic-speech-recognition&sort=downloads
to find other datasets in the same repository.

Here's one that uses commonvoice:
https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english

I'm worried, because it is simply a finetuning of a model trained on
the same data, to different data. This is not very different. A
different approach than two models with different training sets may be
more interesting if it's hard to find such things. I'm also worried
that the model is much larger, and this could really slow down
development. Noting that it's probably not hard to reproduce their
finetuning work on a smaller model.

It is good to develop the experience of using more than one model to
do this. But I may not be  completing this challenge if I get
distracted. It's unfortunate my planned approach is seeming difficult
to me.

I'll try this other model, see if it's reasonable to try.

Here's the example from the model page:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000,
return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values,
attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)