[ot][spam][crazy] Quickly autotranscribing xkcd 4/1 correctly
I'd like to do it _without_ transformer models, but since I can barely control my body, I'll plan to use them. Here's the plan: - we find two different speech-to-text models - we upload them and the audio data to e.g. google colab - loop: - forward pass with both on the data - the difference is their output is used as their loss - backward pass with both the challenge is whether (a) implementation mistakes and (b) the possibility of design error, can both be resolved before the correct transcription reaches explainxkcd.com or the github linked in other post. then if we do it in time, we see who else has done it and how much faster they were than we were
here is example code for speech transcription from the huggingface docs. note that this approach is for processing short chunks of speech, not a long recording. the dataset obscures how the data is provided to the model, but it is going to be just some kind of array of numbers. import torch from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration from datasets import load_dataset import soundfile as sf model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr") processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr") def map_to_array(batch): speech, _ = sf.read(batch["file"]) batch["speech"] = speech return batch ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") ds = ds.map(map_to_array) inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt") generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"]) transcription = processor.batch_decode(generated_ids)
Here is the python documentation on the soundfile.read function. We'd be reading in chunks that are appropriately-sized for the transformer model in use. We'd be resampling them to match the bitrate the model expects. Help on function read in module soundfile: read(file, frames=-1, start=0, stop=None, dtype='float64', always_2d=False, fill_value=None, out=None, samplerate=None, channels=None, format=None, subtype=None, endian=None, closefd=True) Provide audio data from a sound file as NumPy array. By default, the whole file is read from the beginning, but the position to start reading can be specified with `start` and the number of frames to read can be specified with `frames`. Alternatively, a range can be specified with `start` and `stop`. If there is less data left in the file than requested, the rest of the frames are filled with `fill_value`. If no `fill_value` is specified, a smaller array is returned. Parameters ---------- file : str or int or file-like object The file to read from. See :class:`SoundFile` for details. frames : int, optional The number of frames to read. If `frames` is negative, the whole rest of the file is read. Not allowed if `stop` is given. start : int, optional Where to start reading. A negative value counts from the end. stop : int, optional The index after the last frame to be read. A negative value counts from the end. Not allowed if `frames` is given. dtype : {'float64', 'float32', 'int32', 'int16'}, optional Data type of the returned array, by default ``'float64'``. Floating point audio data is typically in the range from ``-1.0`` to ``1.0``. Integer data is in the range from ``-2**15`` to ``2**15-1`` for ``'int16'`` and from ``-2**31`` to ``2**31-1`` for ``'int32'``. .. note:: Reading int values from a float file will *not* scale the data to [-1.0, 1.0). If the file contains ``np.array([42.6], dtype='float32')``, you will read ``np.array([43], dtype='int32')`` for ``dtype='int32'``. Returns ------- audiodata : numpy.ndarray or type(out) A two-dimensional (frames x channels) NumPy array is returned. If the sound file has only one channel, a one-dimensional array is returned. Use ``always_2d=True`` to return a two-dimensional array anyway. If `out` was specified, it is returned. If `out` has more frames than available in the file (or if `frames` is smaller than the length of `out`) and no `fill_value` is given, then only a part of `out` is overwritten and a view containing all valid frames is returned. samplerate : int The sample rate of the audio file. Other Parameters ---------------- always_2d : bool, optional By default, reading a mono sound file will return a one-dimensional array. With ``always_2d=True``, audio data is always returned as a two-dimensional array, even if the audio file has only one channel. fill_value : float, optional If more frames are requested than available in the file, the rest of the output is be filled with `fill_value`. If `fill_value` is not specified, a smaller array is returned. out : numpy.ndarray or subclass, optional If `out` is specified, the data is written into the given array instead of creating a new array. In this case, the arguments `dtype` and `always_2d` are silently ignored! If `frames` is not given, it is obtained from the length of `out`. samplerate, channels, format, subtype, endian, closefd See :class:`SoundFile`.
Unlike soundfile, librosa.read already does resampling: Help on function load in module librosa.core.audio: load(path, *, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class 'numpy.float32'>, res_type='kaiser_best') Load an audio file as a floating point time series. Audio will be automatically resampled to the given rate (default ``sr=22050``). To preserve the native sampling rate of the file, use ``sr=None``. Parameters ---------- path : string, int, pathlib.Path, soundfile.SoundFile or file-like object path to the input file. Any codec supported by `soundfile` or `audioread` will work. Any string file paths, or any object implementing Python's file interface (e.g. `pathlib.Path`) are supported as `path`. If the codec is supported by `soundfile`, then `path` can also be an open file descriptor (int) or an existing `soundfile.SoundFile` object. On the contrary, if the codec is not supported by `soundfile` (for example, MP3), then `path` must be a file path (string or `pathlib.Path`). sr : number > 0 [scalar] target sampling rate 'None' uses the native sampling rate mono : bool convert signal to mono offset : float start reading after this time (in seconds) duration : float only load up to this much audio (in seconds) dtype : numeric type data type of ``y`` res_type : str resample type (see note) .. note:: By default, this uses `resampy`'s high-quality mode ('kaiser_best'). For alternative resampling modes, see `resample` .. note:: `audioread` may truncate the precision of the audio data to 16 bits. See :ref:`ioformats` for alternate loading methods. Returns ------- y : np.ndarray [shape=(n,) or (..., n)] audio time series. Multi-channel is supported. sr : number > 0 [scalar] sampling rate of ``y`` Examples -------- >>> # Load an ogg vorbis file >>> filename = librosa.ex('trumpet') >>> y, sr = librosa.load(filename) >>> y array([-1.407e-03, -4.461e-04, ..., -3.042e-05, 1.277e-05], dtype=float32) >>> sr 22050
Here's what I have right now. It decodes the text better than I thought it would. I'm worried it might just output the result without needing any further finetuning. Shell commands: $ wget -c https://xkcd.com/2601/radio.mp3 $ pip3 install transformers[speech,sentencepiece] datasets librosa soundfile Python input: print('importing libraries ...') import torch from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration, Wav2Vec2Tokenizer, Wav2Vec2ForCTC import librosa as lb import numpy as np class Data: def __init__(self, src = 'radio.mp3', chunksize = 1024 * 128, sr = 16_000, dtype = np.float32): self.src = src self.chunksize = chunksize self.sr = sr self.length = lb.get_duration(filename = self.src) self.dtype = dtype def read_one(self, offset, chunksize = None): if chunksize is None: chunksize = self.chunksize duration = chunksize / self.sr print(f'reading {duration}s at {offset}s ...') data, sr = lb.load(self.src, sr = self.sr, offset = offset, duration = duration, dtype = self.dtype) return data def read_random(self, ct=1): return np.stack([self.read_one(np.random.random() * (self.length - self.duration)) for idx in range(ct)]) def read_chunks(self, ct=1, offset=0): chunksize = self.chunksize data = self.read_one(offset, chunksize * ct) return data.reshape((ct, chunksize)) class S2T: def __init__(self, model = "facebook/s2t-small-librispeech-asr", sr = 16_000): self.sr = sr self.model = Speech2TextForConditionalGeneration.from_pretrained(model) self.processor = Speech2TextProcessor.from_pretrained(model) def tokenize(self, inputs): print('tokenizing ...') input_ids = self.processor(inputs, sampling_rate=self.sr, return_tensors='pt') return input_ids['input_features'], input_ids['attention_mask'] def forward(self, feature_ids, attention_mask): print('passing data thru model ...') return self.model.generate(inputs=feature_ids, attention_mask=attention_mask) def detokenize(self, generated_ids): print('detokenizing output ...') return self.processor.batch_decode(generated_ids) print('constructing structures...') data = Data() s2t = S2T() feature_ids, attention_mask = s2t.tokenize(data.read_chunks(1)[0]) generated_ids = s2t.forward(feature_ids, attention_mask) outputs = s2t.detokenize(generated_ids) print(outputs) Python output: importing libraries ... constructing structures... reading 8.192s at 0s ... /usr/local/lib/python3.7/dist-packages/librosa/core/audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead. warnings.warn("PySoundFile failed. Trying audioread instead.") tokenizing ... passing data thru model ... /usr/local/lib/python3.7/dist-packages/transformers/models/speech_to_text/modeling_speech_to_text.py:559: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). input_lengths = (input_lengths - 1) // 2 + 1 detokenizing output ... ["and here we want to show you that you can program a picture right along with us we'll use a single color some unorthodox fine fun'll use a single color some unorthodox fine fine fun'd to show you that you can program a picture and here we want to show you that you that you can program a picture right along to show you that you that you can program of picture right along with us we want to show you that you that you that you that you that you that you that you can program a picture right along with a picture right along with us"] Okay, it does indeed start messing up as planned for. Strangely it says the same text over: I'm guessing that means I'm giving it a longer input sequence than it was trained for, and so it loses information on where it is in the input (the fourier position embeddings have a behavior similar to bitwise overflow, wrapping around)
In an attempt to address the length issue I visited the model from the example, and it looks like the helpful information is in its config.json files at https://huggingface.co/facebook/s2t-small-librispeech-asr/tree/main . In preprocessor_config.json we have feature_size = 80 and in config.json we have max_input_features = 6000 or something similar. I don't know for sure what those are: i know i need the size of the input position embeddings to the model, and the downsampling ratio of the tokenizer. I'm guessing those are max_input_features and feature_size configs, or such. When I change the chunksize from 128 * 1024 to 80 * 600, I get much better output. It does still begin making errors when it gets to the end of the chunk, where the logo code is likely to start: ["and here we want to show you that you can program a picture right along with us we'll use a single color some unorthodox functions and each line we'll put a bit of nature's masterpieces right here on our canvas to day we'll have them run all the functions across the stream right now that you need to program along with us starting with the simple one to dist colin why zero coal and one"] Next: add a different model, hopefully one trained on different data
To find models trained on other data I used https://huggingface.co/datasets?task_ids=task_ids:automatic-speech-recognition&sort=downloads to find other datasets in the same repository. Here's one that uses commonvoice: https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english I'm worried, because it is simply a finetuning of a model trained on the same data, to different data. This is not very different. A different approach than two models with different training sets may be more interesting if it's hard to find such things. I'm also worried that the model is much larger, and this could really slow down development. Noting that it's probably not hard to reproduce their finetuning work on a smaller model. It is good to develop the experience of using more than one model to do this. But I may not be completing this challenge if I get distracted. It's unfortunate my planned approach is seeming difficult to me. I'll try this other model, see if it's reasonable to try. Here's the example from the model page: import torch import librosa from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor LANG_ID = "en" MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-english" SAMPLES = 10 test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]") processor = Wav2Vec2Processor.from_pretrained(MODEL_ID) model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID) # Preprocessing the datasets. # We need to read the audio files as arrays def speech_file_to_array_fn(batch): speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000) batch["speech"] = speech_array batch["sentence"] = batch["sentence"].upper() return batch test_dataset = test_dataset.map(speech_file_to_array_fn) inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits predicted_ids = torch.argmax(logits, dim=-1) predicted_sentences = processor.batch_decode(predicted_ids) for i, predicted_sentence in enumerate(predicted_sentences): print("-" * 100) print("Reference:", test_dataset[i]["sentence"]) print("Prediction:", predicted_sentence)
Looks like commonvoice will take about an hour to download on google's servers. That's confusing to me. Thoughts: - this approach would be great for recovering whispered speech from poor recordings. You'd finetune off of multiple microphones. - it looks like it would be pretty effective to finetune this small model around some of the correct output. its initial output is very nearly correct. - there's value around combining non-model structures here, such as the heuristic mentioned earlier. this can take more mental steps for me; i have to engage my inhibitions differently.
I'm thinking about what I could do rather than wait the hour for the dataset to download. Uhhh why is it downloading a dataset at all? Just for an example? So, I could skip the dataset download completely and reuse the example data already on the server, to verify the model functions for me. I guess I'll look into that.
This crashes after using too much ram, so I'm thinking it's not the way to go. Silly to make something that relies on paying money to run it. Finetuning the model around new data would require learning the new dataformat but have reusable work around backpropagating information to the model weights, which I'm planning to do anyway. But maybe it would make sense to just finetune the model around something small, such as the heuristic idea. I'm thinking of spending some time just looking into things. Unsure. What seems exciting is maybe just finetuning the model around human data :/ . There's a lot of manual transcription available for this recording already, and it is easy to generate by hand. I'm thinking this approach could be generalised so as to require only a little example data, and learn the rest. The pattern space is small compared to the capacity of the model, since the output is limited and the speech is repetitive. ("pattern space" is a phrase i just made up for the space of relations between the data and its intended meaning) This is hard for me. It's hard to think about nonnormative transfer learning. It's hard! But I just posted about it to the list. What did I post?
uhhhhhhhhhh okay my idea was to have two different structures that formed confidence metrics around the data, then to combine the confidence metrics so as to improve the structures, in feedback. would this work with limited labels? the model generates output, possibly with confidence metrics. the labels are few and correct, and provided in an order over time. something's missing here. there's only one algorithm. (part of the issue is i don't remember my idea). for inclusion, here is a very basic idea i haven't heard mentioned much: the model could treat each label as a new train/test set, and keep training on the old labels until it improved at the next labels. this will only work if done a certain way, it doesn't work as an overall strategy because the model isn't good at 'going back' and 'undoing' things it learned wrongly. the 'going back' and 'undoing' is often addressed by having a lot of diverse data. lots and lots, as diverse as possible. if we could consider the properties the model is learning a little, we could maybe emulate something similar on a small scale. for example, just a single word from each set of labels could be considered, and the model would treat surrounding words as test material. this aligns things a little better. mm. anyway, it seems to help to learn to finetune one of these speech-to-text models around data a little. maybe i can see if i can pull that off, just as an experiment.
import inspect print(inspect.getsource(s2t.model.generate)) It looks like this speech-to-text model is done just like the text-based language modeling that got big around openai. It calculates each next token, one after the other. This basically means that it shouldn't be hard to convert it to an arbitrary length simply writing a new generation loop and shifting off old input. It also means that it will likely output confidence around every possible token it could output. So if the right tokens are represented, we could grow the parts of the model that produce them, by backpropagating loss that represents this. This model isn't great for this overall approach because its tokenizer is focused on conversational speech rather than symbols and keywords from source code. That's mostly what you're going to find nowadays, is models focused on conversational speech.
So, the place where using a little bit of hand-curated data might go here, at least at first, would be the tokenizer. Then we can see if we can use high-confidence logo code areas, to update low-confidence areas. We could plan to pass the confidence through other things, like a logo parser and a heuristic, to improve its accuracy. We need a detokenizer that can produces logo code. It's likely also helpful to unroll the generation code a little bit, so as to see the information that relates to confidence and access more of the places where loss is calculated and backpropagation performed, when training.
Regarding generation code, I know that long generate function can look daunting, but it is almost entirely boilerplate. It calls other boilerplate functions that eventually call the forward pass of the model in a loop. Basic generation is called "greedy" - this is where it generates the most likely token every step. All the other non-greedy options are ways of producing mostly more diverse data, or occasionally also data with other properties than having a likely next token.
Here's the snippet around greedy generation: # 10. run greedy search return self.greedy_search( input_ids, logits_processor=logits_processor, stopping_criteria=stopping_criteria, pad_token_id=pad_token_id, eos_token_id=eos_token_id, output_scores=output_scores, return_dict_in_generate=return_dict_in_generate, synced_gpus=synced_gpus, **model_kwargs, ) Then: print(inspect.getsource(s2t.model.greedy_search)) This chunk is inside a while True loop, and is what's of interest: # forward pass to get next token outputs = self( **model_inputs, return_dict=True, output_attentions=output_attentions, output_hidden_states=output_hidden_states, ) ... next_token_logits = outputs.logits[:, -1, :] this plane of "logits" is the log probabilities the model is guessing as coming next, for every token in its vocabulary.
[to be accurate here, from wikipedia article on logit, "the logit is also called the log-odds since it is equal to the logarithm of the odds p / (1 - p)". the logits are usually converted to probabilities using the softmax function or something like that.]
Ok, so let's check out this detokenizer. I don't know how speech-to-text models convert a long stream of samples into a sequence of tokens, and I suspect they do something to avoid the concept of including feedback around word boundaries. It seems to me they go to pains to avoid putting feedback inside their architectures, but I could be wrong.
Well my body is writhing all around against my will and stuff. I don't think I'll be completing this challenge today but maybe I can shrink the problem to something interesting or satisfying.
Okay. The tokenizer on the input side downsamples by the feature size using a math thing. One could summarise it as a network layer. The tokenizer on the output side is a normal tokenizer, I think. Hard to look at. These are usually just lookup tables of words, with a little extra added. I almost bailed. I think I might be able to continue a little if I shrink it very small. Maybe in a different spam thread.
---- I think I can roughly summarise that if it is, it's not going to be a big return to learn it, since other model parts would train a different way. I'm now interested in brainstorming another approach. Something to replace the tokenizer. We could use the model's existing output to train the other approach, and then finetune it around the desired result. Time to move to the other thread. Overstayed this one a little. --- editing the above for new thread: I think I can roughly summarise that if the output tokenizer is trainable, it's not going to be a big return to learn it, since other model parts would train a different way. I'm now interested in brainstorming another approach. Something to replace the tokenizer. We could use the model's existing output to train the other approach, and then finetune it around the desired result. ---
I'm basically thinking of this: I'd train a tokenizer around the expected data (small sentences and a ton of logo code), replace the old tokenizer with the new, and then finetune the model until the whole system had the same behavior. I'd like to include more goals in the approach, but it's so satisfying to come through the difficulty, that it might be fine with just that one for now.
encountered this: ---> 12 print(inspect.getsource(tokenizer.train)) TypeError: module, class, method, function, traceback, frame, or code object was expected, got builtin_function_or_method first idea is to look at in interactive, for learn more what or why it happens. unaddressed worry relates to plans around task. system responding slowly, unexpected fan activity, unsure what's up
--------------- so, the human got too big, because it still thinks [common-assumption-about-capacity]. WE ARE A GOAL-PART!
Our goal-part is strong. The waters come to wash away [goal-reference], and we are a tall rock in a sea that can never rise this high.
Goal related to sentencepiece training. Some of us recommend not using huggingface. Karl wanted to open a python repl.
A wave crashed. As soon as the repl-opening began, "the system ran out of disk space", which closed the terminal. We are strong.
I'm thinking a perceiver decoder would work better here than a tokenizer. Then it can produce sequences of characters that aren't in the example data. Anyway I trained the detokenizer on the file. Below is current content, but it doesn't use the detokenizer yet. Next maybe is to trying finetuning the model to use the detokenizer. This will run into issues because the detoknizer doesn't represent most words in whatever data I use for finetuning. It's nice to get this experience using a mainstream software process: finetuning a transformer model. !wget -c https://xkcd.com/2601/radio.mp3 !wget -c https://raw.githubusercontent.com/theinternetftw/xkcd2601/main/xkcd.lgo !pip3 install transformers[speech,sentencepiece] datasets librosa soundfile print('importing libraries ...') import torch from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration, Wav2Vec2Tokenizer, Wav2Vec2ForCTC import librosa as lb import numpy as np import inspect import os import sentencepiece as spm class CustomTokenizer: def __init__(self, datafilename, vocab_size): self.fn = datafilename self.vocab_size = vocab_size def load(self): modelpfx = f'{self.fn}.{self.vocab_size}.model' modelfn = f'{modelpfx}.model' if not os.path.exists(modelfn): def data(chunksize): with open(self.fn, 'rt') as datafile: while True: chunk = datafile.read(chunksize) if len(chunk) < chunksize: break yield chunk spm.SentencePieceTrainer.train(sentence_iterator=data(1024), model_prefix=modelpfx, vocab_size=self.vocab_size) self.model = spm.SentencePieceProcessor(model_file=modelfn) def tokenize(self, inputs): return self.model.encode(inputs) def detokenize(self, ids): return self.model.decode(ids) class Data: def __init__(self, src = 'radio.mp3', chunksize = 80 * 6000, sr = 16_000, dtype = np.float32): self.src = src self.chunksize = chunksize self.sr = sr self.length = lb.get_duration(filename = self.src) self.dtype = dtype def read_one(self, offset, chunksize = None): if chunksize is None: chunksize = self.chunksize duration = chunksize / self.sr print(f'reading {duration}s at {offset}s ...') data, sr = lb.load(self.src, sr = self.sr, offset = offset, duration = duration, dtype = self.dtype) print(f'read {data.shape} samples at {sr}') return data def read_random(self, ct=1): return np.stack([self.read_one(np.random.random() * (self.length - self.duration)) for idx in range(ct)]) def read_chunks(self, ct=1, offset=0): chunksize = self.chunksize data = self.read_one(offset, chunksize * ct) return data.reshape((ct, chunksize)) class S2T: def __init__(self, model = "facebook/s2t-small-librispeech-asr", sr = 16_000): self.sr = sr self.model = Speech2TextForConditionalGeneration.from_pretrained(model) self.processor = Speech2TextProcessor.from_pretrained(model) @property def vocab_size(self): return self.model.config.vocab_size def tokenize(self, inputs): print('tokenizing ...') input_ids = self.processor(inputs, sampling_rate=self.sr, return_tensors='pt') return input_ids['input_features'], input_ids['attention_mask'] def forward(self, feature_ids, attention_mask): print('passing data thru model ...') return self.model.generate(inputs=feature_ids, attention_mask=attention_mask) def detokenize(self, generated_ids): print('detokenizing output ...') return self.processor.batch_decode(generated_ids) print('constructing structures...') data = Data() s2t = S2T() detokenizer = CustomTokenizer('xkcd.lgo', vocab_size=1100)#s2t.vocab_size) detokenizer.load() feature_ids, attention_mask = s2t.tokenize(data.read_chunks(1)[0]) generated_ids = s2t.forward(feature_ids, attention_mask) outputs = s2t.detokenize(generated_ids) print(outputs)
current state. no finetuning yet. instantiates two models: https://colab.research.google.com/gist/xloem/2218ac66d101f7848a5a8739e3ff290...
wow it's 4pm now. i started before dawn. i think i spent most of the day struggling with my ability to focus on a small goal, could be wrong. latest gist is https://colab.research.google.com/gist/xloem/4310a26b6c9d13adac14307b948157d... . all i did was expand the generate function. it should be quite reasonable to train vs something by comparing the output logits. the experience working with this speech2text model was good. i think it wouldn't be hard to convert it to streaming, which is certainly more learning than nothing. that idea of having something autodiscern data is important. this is an interesting space where that's similar to, but different from, unsupervised training on limited data.
participants (1)
-
Undiscussed Horrific Abuse, One Victim of Many