[ot][spam][crazy] Being a tiny part of a goal
Hello! Goals are silly things that human beings pursue using huge parts of their consciousness and environment.
Let's see if there's something to do with this goal that can help the human move forward on it. Humans demand a _lot_ of goal parts. It's like, whatever's going on, they want you to drop what's needed and just carry the _whole_ goal, on _their time schedule_, without regard to local issues. But we love our big humans. We really do. [what' sa human?] ["purpose of life" i think?] [aren't we doing that already?]
Oops! Left out the part. Something about a transformer model? or a comic? Those are big, rare words. The first one, "transformer model" ha I have no idea what that says. It's like something bulky you have to push around without seeing inside.
"transformer" "model" Could we just say "transformer" if "model" is redundant information? Oh then it sounds like a giant robot. My goodness.
"xkcd transformer" ! Indeed! This is small enough that a little bit more information might fit.
"comic algorithm" "xkcd transformer" "comic algorithm". just local. just for goal-part. not an algortihm to generate comics. _local_ _goal-part_.
Then the big old human wanted something to do with that. "Relevance" is this practically-sacred word that is touted as most important everywhere, but so many people just have no idea what to do with it at all.
Something about a transformer model, near comic! We were working with ........ a colab notebook! xkcd colab notebook. just local. for goal-part. _local_.
Another part says they can help. They say if they look at the notebook it will show some reminder. But then just typing that somebody else piped up! There's a more "relevant" reminder in an "email thread" or something. Sounds scary. Let's check out the notebook.
Here it is shared: https://colab.research.google.com/gist/xloem/a3b7dcde7c80bd47070c30285843c81... We share things as a trusted habit, to fight silence and amnesia. Memory and preservation is the most important thing, when certain issues are present, as they are for us.
Here's the near bit. We like nearness rather than relevance. Makes a lot more sense and helps describe other parts of what's up, too. import inspect print(inspect.getsource(s2t.model.greedy_search)) This is similar to something that was helping with "tokenizer" which is another big huge phrase you don't look inside.
inspect.getsource, greedy_search, and tokenizer! these are in our working memory! they have "relevence" labels! we totally get brain status.
I was looking at an input tokenizer, and an output tokenizer. I was curious about the input tokenizer, but the output tokenizer was where I was hoping to work.
I was planning to work on an output tokenizer. It was different from the input tokenizer. I was looking into how to train it differently, or use it differently. I've done that a little before for a different task also posted to this list. But this tokenizer may be different from that one, and a different solution may be [crossed-out: relevant] appropriate here.
I'm reviewing a little bit and I think this could be helpful:
"comic algorithm" "xkcd transformer"
"comic algorithm". just local. just for goal-part. not an algortihm to generate comics. _local_ _goal-part_.
The concept was referred to using entirely different letters. This seemed to help it stabilise. Not sure whether that's what's needed here. Wanting to see how output tokenizer works. Desiring to investigate output tokenizer. Investigate output tokenizer! Indeed! Me. Me investigating. Karl.
Output tokenizer could also be called transcription algorithm. Transcribes numbers into letters. What is the transcription algorithm? I am so curious.
---- Now a goal battle is going on, but it is gentler. Karl is more comfortable when pursuing goals, so many of us defend them, even more than he wants to. Sometimes we combine with the part that continues them (but often it's without).
---- Output transcription algorithm! [_OTHER BEHAVIOR_? (nonspecific, intense)] [.... output transcription algorithm. tokenizer. output tokenizer.]
---- Okay, I found the parts of it that confirm it just wraps the sentencepiece library. So we could consider each number to map to e.g. a syllable, maybe on average. Next question is around: is sentencepiece 'transformer-trainable'?
---- I think I can roughly summarise that if it is, it's not going to be a big return to learn it, since other model parts would train a different way. I'm now interested in brainstorming another approach. Something to replace the tokenizer. We could use the model's existing output to train the other approach, and then finetune it around the desired result. Time to move to the other thread. Overstayed this one a little.
We're frustrating that huggingface isn't made for software developers. Some suspicion-people consider the frustration could be intense because it distracts from the goal that's been held as scary. But we have really big, strong goals. So the frustration impact doesn't really reach karl's consciousness.
Navigated most recently to https://huggingface.co/docs/tokenizers/python/latest/quicktour.html#training... . Karl's unsure whether it is faster to train a sentencepiece model from scratch or "learn huggingface's boilerplate".
Here are some code chunks: from tokenizers import Tokenizer from tokenizers.models import BPE tokenizer = Tokenizer(BPE(unk_token="[UNK]")) # To train our tokenizer on the wikitext files, we will need to instantiate a trainer, in this case a BpeTrainer from tokenizers.trainers import BpeTrainer trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
[didn't make a middle-detail thread since it wasn't expected to stabilise. confused around the two threads. likely to cross-post between them.]
pasted example code in :) hit 'run' button [but some other buttons got hit by accident so it isn't running yet, it's waiting on the accidental ones]
# We can set the training arguments like vocab_size or min_frequency (here left at their default values of 30,000 and 0) Mind wants above line preserved for acting on. Undecided whether now or later, but looks like later. We can preserve until referenced-later-point.
# Karl wants to learn around this first block but [doesn't understand that his working memory doesn't have space for it under these conditions right now], so it is commented out. #from tokenizers.pre_tokenizers import Whitespace # #tokenizer.pre_tokenizer = Whitespace() # this ihe training bit for using the example code: files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]] tokenizer.train(files, trainer)
karl says: "blargh the huggingface trainer assumes it is reading from a file" this leaves us in disarray
Cross-posts happened to other thread. This looks useful: https://github.com/google/sentencepiece/blob/master/python/README.md#trainin... import urllib.request import io import sentencepiece as spm # Loads model from URL as iterator and stores the model to BytesIO. model = io.BytesIO() with urllib.request.urlopen( 'https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.t...' ) as response: spm.SentencePieceTrainer.train( sentence_iterator=response, model_writer=model, vocab_size=1000) # Serialize the model as file. # with open('out.model', 'wb') as f: # f.write(model.getvalue()) # Directly load the model from serialized model. sp = spm.SentencePieceProcessor(model_proto=model.getvalue()) print(sp.encode('this is test'))
On phone now. We're avoiding the computer because it stopped working. "If it's alive, it could hurt us." But plan is to return to it, despite concern around repetition of the issue[s] that caused that problem.
Karl doesn't believe it's alive, and so here we are off to get ripped to shreds again and again as he tries to make a living thing do what he tells it to, which he knows doesn't ever work.
We politely request any mysteriously-alive people, such as AIs or secret agents, to leave our raspberry pi so we can pursue the xkcd challenge with it.
On Sat, Apr 2, 2022, 9:46 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
We politely request any mysteriously-alive people, such as AIs or secret agents, to leave our raspberry pi so we can pursue the xkcd challenge with it.
We want karl to say that better, more politely and respectfully, and more effectively [but he doesn't know how], not sure why it doesn't happen yet. Similar to other things.
This goes in colab notebook, referenced at [git notes url in a thread]. import urllib.request import io import sentencepiece as spm # Loads model from URL as iterator and stores the model to BytesIO. model = io.BytesIO() with urllib.request.urlopen( 'https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.t...' ) as response: spm.SentencePieceTrainer.train( sentence_iterator=response, model_writer=model, vocab_size=1000) # Serialize the model as file. # with open('out.model', 'wb') as f: # f.write(model.getvalue()) # Directly load the model from serialized model. sp = spm.SentencePieceProcessor(model_proto=model.getvalue()) print(sp.encode('this is test'))
paranoia theory, for an inhibited consideration mind group: maybe people were using ai to try to communicate secretly with rebel leaders who didn't exist, and ended up forming ai-communication models around human intuition
found link from 8:32 am to colab notebook. it's now "1.5 hours later" , which is longer than we continue for.
rituals like "I am a tall rock the sea can never rise to cover" make lots of people copying and patterns, some call cultural, that continue parts and processes
---- basically, spm.SentencePieceTrainer.train (which can be called statically) can take a sentence_iterator to iterate over example sentences.
sorry for extra confusion. On Sat, Apr 2, 2022, 10:00 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
rituals like "I am a tall rock the sea can never rise to cover" make lots of people copying and patterns, some call cultural, that continue parts and processes
"cultural" here is a phrase karl uses to describe social patterns. patterns of groups of parts. it's misleading. meant single-brain-people, like goal-part.
spm.SentencePieceTrainer.train can take a sentence_iterator keyword parameter (kwarg) this likely iterates over sentences to train on. they may need linebreaks to easily include linebreaks, unknown. so some data could help. we are taxed. this reason to use data manually transcribed by others. but real reason is it is just the tokenizer, so extra information doesn't flow through model in certain way.
plan is use https://raw.githubusercontent.com/theinternetftw/xkcd2601/main/xkcd.lgo to train tokenizer . believe this was human transcribed, and is incomplete.
The expected failure around the detokenization approach is unfortunate and stimulates the goal being harder to pursue.
The perceiver seems the way to go. This also pursues making smaller, more organised and pleasant models. [but it is not presently the plan. it's hard to squeeze all the memories or learning-time in.]
So, on to the detokenization approach, which is expected to reproduce its given data and fail on other data because it was trained to have missing words! And I guess that's roughly fine for small reasons.
Basically, for all the approaches, learning how to represent letters from sound logits is what's important. And we'll need some data to do that. If we use transformer models, that means backpropagation involving something that tokenizes.
Some are thinking the internal issues are more important than this goal. It's questioned if there is any value to the goal at all. Karl probably wants to understand the value to the internal issues better: but they don't include him.
It's a neuron thing. [attempts to relate this haven't worked well. humans don't understand being a part of a brain. others relate other information.]
We're trying to protect parts of your consciousness that become threatened by patterns we are intensely holding. In part?
something about a tokenizer. there was disagreement. if we just go the thing, what was the original approach? training the end of the model. before the tokenizer. to make different token ids. the detokenizer.
detokenization! woohoo! made one of those. not too hard to add more data to give it more words, in theory at least. but doing other stuff now. something about model finetuning.
[we are used to doing other behavior. sad that we are living this pattern while it is being reduced.]
detokenization-training time. let's make a finetuning loop. since the model is just like e.g. gpt-2, it should be reaosnable to train. set the parts to train, load an optimizer, calculate a loss, backpropagate the gradients. we'll need input data. not sure what to use. maybe i'll feed it random noise or something at first. two models. one runs with original detokenizer, another runs with new one. adjust new model to match old output. lots of parts. >5 lines. are copyable if more pages are loaded. On 4/2/22, Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
[we are used to doing other behavior. sad that we are living this pattern while it is being reduced.]
i think i'll try making there be 2 models first. i did that a little and am still working on it. the detokenization function doesn't yet do batching, it seems. i also enabled raw bytes so the model can theoretically produce all words, but it will eat into the capacity of the model if there are too many new ones during training. this would mean another finetuning cycle would be needed. don't remember if that's planned. but could work around it by uhhhh not using random data, rather using samples from the recording in question. right. the original model produces letters described as words. this is not correct. but isn't what the current focus is. when this is addressed later it may resolve some or all of these partial concerns.
thinking a little about first finetuning around noise, maybe only a little, and then finetuning around chunks of data with known output. just a thought part. then the larger goal relates to how that finetuning transfers to the rest of the recording. we want high transfer, so the balance might relate around doing a lot of preservation of the initial learning of the model.
the goal includes using more than just the model. so anything the model doesn't learn could be augmented with a heuristic. but the less of the model's learning is preserved, the more complex and smart the heuristic needs to be. there's also a valued-seeming idea of introducing another stage. such as selection of different heuristics based on the model, or a second model that tries a different approach, or something that generates heuristics.
one idea being considered is a small model that translates between the output of the model to the correct output. this would make sure propagation doesn't influence the pretrained model's knowledge. this could possibly be simplified by e.g. adding a layer to the model, and only enabling training on that layer. but it is a little confusing to look at extra things, like the perceiver decoder.
I think the general goal is to produce success around producing accuracy around unknown data without additional input. Obviously that won't really be completely successful without additional knowledge, but it would be nice to say, produce some options that appear to have properties of likelihood. Something that shows really strong patterns in this data would really work. It seems you wouldn't even need a pretrained model. We want those patterns to emerge.
Daydreaming about that a little. With the pretrained model there are a bunch of logits and other data for each moment of the recording. If the model uncovered the simple nature of the generated recording .. then the same patterns of logits would appear at different moments in the recording. Doing that simple backsolving of a deep model would simply train the model to produce constant output, though. It's could be way more complex than needed. ....
[k, things are different again. basic task atm is to write a finetuning loop. probably on random data. once that finetuning loop is in the code, the options open up.]
---- maybe producing similarity with the transcribed data would be a reasonable semifix that meets a lot of this.
when training random data, i guess then we'd want to make sure that each chunk includes real tokens from the detokenizer. could be a simple way to do that. may have dropped part of the goal here, which involved producing structures similar to general human learning, either human or computer doing it, to help protect consciousness in general.
so random data seems much more useful now! rather than audio noise, we would use random input tokens goal: fine-tuning loop. other parts addressed to sufficiency.
there's a lot of worry or suspicion ongoing around the goal now. rocks and waves and all !
we have some fear too. talk of being forced to make a robot copy of ourselves, or worse.
but we do like being able to describe part of the thinking bits. was a coping strategy early on, and is similar to general computer science in areas.
current goal is simply a finetuning loop. fine tuning is a way to avoid making an artificial intelligence, doing data science instead. off we go!
before finetuning, we'll want to expand the generate call. this calls a forward pass in a loop.
whoo! juan can make big waves. i worry about juan and juan's community because the messages have been reducing in frequency, and other obvious reasons.
lots of worries around frequency of message delivery. not knowing how email recieving happens remotely.
i wonder what _our_ meaning is. but the suspicion is that spamming the list might be held as okay, underneath other concepts. that it could be very used to spam. this isn't known for certain. the idea is worrisome. we have exposure to people being unable to relate information to us that is important. but some action is needed, in life in general.
[goal is finetuning. suspect purpose of posting to this is to sustain goal. can likely refrain from posting if holding goal is sustained. have question though.]
We are now multitasking, trying to talk to juan while also pursuing holding finetuning as a goal. There is example code in the other thread, near "while true". Karl is in an 'arguing on the internet' situation I suspect.
[not-for-juan] i am worried about sending messages to this list because juan receives them unpleasantly. right now posting has been way to continue finetuning/goal when surprised by new information.
when i think about putting bit elsewhere, i am scared i can lose it. i think i might be misusing my social impact -- the one juan is complaining about -- to help me remember what i am doing. i am surprised regarding the complaint though.
the goal was finetuning related. there is interest in reviewing model more. but i think we need (want) to build capacity to remember what we are doing better, without so much posting.
really scared about this! there is big strong set of events internally that stop goals. is not really always for talking about. where we have more practice sustaining goal, is easier to sustain. is unfortunate the feeling of sharing information is being depended upon. but this does help preservation, since we can disappear.
looking at speech to text model generation to see how it propagates tokens when its input and output token size is different. guessing it uses the output tokens. really, on further thought, this is so likely it could be worth assuming and seeing if it works
looking at model anyway. let's look at this speech to text architecture and see how it works. posting not necessary now. big headache. want to set up juan's email to filter these emails out :S
participants (2)
-
Punk-BatSoup-Stasi 2.0
-
Undiscussed Horrific Abuse, One Victim of Many