[ot][spam][crazy] Quickly autotranscribing xkcd 4/1 correctly
Undiscussed Horrific Abuse, One Victim of Many
gmkarl at gmail.com
Sat Apr 2 03:18:35 PDT 2022
Here's what I have right now. It decodes the text better than I
thought it would. I'm worried it might just output the result without
needing any further finetuning.
Shell commands:
$ wget -c https://xkcd.com/2601/radio.mp3
$ pip3 install transformers[speech,sentencepiece] datasets librosa soundfile
Python input:
print('importing libraries ...')
import torch
from transformers import Speech2TextProcessor,
Speech2TextForConditionalGeneration, Wav2Vec2Tokenizer, Wav2Vec2ForCTC
import librosa as lb
import numpy as np
class Data:
def __init__(self, src = 'radio.mp3', chunksize = 1024 * 128, sr =
16_000, dtype = np.float32):
self.src = src
self.chunksize = chunksize
self.sr = sr
self.length = lb.get_duration(filename = self.src)
self.dtype = dtype
def read_one(self, offset, chunksize = None):
if chunksize is None:
chunksize = self.chunksize
duration = chunksize / self.sr
print(f'reading {duration}s at {offset}s ...')
data, sr = lb.load(self.src, sr = self.sr, offset = offset,
duration = duration, dtype = self.dtype)
return data
def read_random(self, ct=1):
return np.stack([self.read_one(np.random.random() * (self.length -
self.duration)) for idx in range(ct)])
def read_chunks(self, ct=1, offset=0):
chunksize = self.chunksize
data = self.read_one(offset, chunksize * ct)
return data.reshape((ct, chunksize))
class S2T:
def __init__(self, model = "facebook/s2t-small-librispeech-asr", sr = 16_000):
self.sr = sr
self.model = Speech2TextForConditionalGeneration.from_pretrained(model)
self.processor = Speech2TextProcessor.from_pretrained(model)
def tokenize(self, inputs):
print('tokenizing ...')
input_ids = self.processor(inputs, sampling_rate=self.sr,
return_tensors='pt')
return input_ids['input_features'], input_ids['attention_mask']
def forward(self, feature_ids, attention_mask):
print('passing data thru model ...')
return self.model.generate(inputs=feature_ids,
attention_mask=attention_mask)
def detokenize(self, generated_ids):
print('detokenizing output ...')
return self.processor.batch_decode(generated_ids)
print('constructing structures...')
data = Data()
s2t = S2T()
feature_ids, attention_mask = s2t.tokenize(data.read_chunks(1)[0])
generated_ids = s2t.forward(feature_ids, attention_mask)
outputs = s2t.detokenize(generated_ids)
print(outputs)
Python output:
importing libraries ...
constructing structures...
reading 8.192s at 0s ...
/usr/local/lib/python3.7/dist-packages/librosa/core/audio.py:165:
UserWarning: PySoundFile failed. Trying audioread instead.
warnings.warn("PySoundFile failed. Trying audioread instead.")
tokenizing ...
passing data thru model ...
/usr/local/lib/python3.7/dist-packages/transformers/models/speech_to_text/modeling_speech_to_text.py:559:
UserWarning: __floordiv__ is deprecated, and its behavior will change
in a future version of pytorch. It currently rounds toward 0 (like the
'trunc' function NOT 'floor'). This results in incorrect rounding for
negative values. To keep the current behavior, use torch.div(a, b,
rounding_mode='trunc'), or for actual floor division, use torch.div(a,
b, rounding_mode='floor').
input_lengths = (input_lengths - 1) // 2 + 1
detokenizing output ...
["and here we want to show you that you can program a picture right
along with us we'll use a single color some unorthodox fine fun'll use
a single color some unorthodox fine fine fun'd to show you that you
can program a picture and here we want to show you that you that you
can program a picture right along to show you that you that you can
program of picture right along with us we want to show you that you
that you that you that you that you that you that you can program a
picture right along with a picture right along with us"]
Okay, it does indeed start messing up as planned for. Strangely it
says the same text over: I'm guessing that means I'm giving it a
longer input sequence than it was trained for, and so it loses
information on where it is in the input (the fourier position
embeddings have a behavior similar to bitwise overflow, wrapping
around)
More information about the cypherpunks
mailing list