[ot][spam][crazy]
i made yet another partial ai seed and i was typing up both an itemized simplified description and a detailed one involving personal theory and code, of it to help preservation, but a finger spasm lost the whole typing again :/ the code is a mess right now and i'm having psychotic-like spasms associated, similar to normal self-modifying ai inhibition maybe i'll spam parts in this as a thread. i liked the organized thing tho. the whole goal of it was to do it really simplistically, really similar to things already existing, very few changes, move theories into spaces that were simple and similar, etc. Long story short, you can make a transformed model generate an improved transformer model of the same or larger size if you do something like the following: - shrink the size to handle only a tiny bit at once, (i used the largest 1-d slice length of the largest parameter) - make the input context huge so you can feed it a whole model - train on only 1 data pair at once - make a separate model or layers for each output model parameter so they can specialize - feed the output back into the input as needed so it can save its work and move on the biggest theoretical component was the idea that an arbitrarily small agent can use a random-access infinite scratchpad and a supply of high-level functions to accomplish an arbitrarily complex task (turing machine). the input context is the scratchpad. [interesting theory that you could maybe turn it into procedural code by reinterpreting it that way]
i used HuggingFace's LlamaModel which is just the Llama architecture. i ignored the model's embedding map and passed my own embeddings which i generated with a trainable linear module from the input model weights and data. similarly, i used a trainable linear layer for the output to generate only 1 float per pass and used it in a causal manner. (you can train on entire sequences, and then infer 1 float at a time). I've trimmed the below code for conciseness so it may have an inconsistency if i made a trimming mistake. import os import torch, transformers class make_one_transformer(torch.nn.Module): def __init__(self, name, input_size, output_size=1, complexity=None, load=True): super().__init__() self.name = name self.input_size = input_size self.output_size = output_size if complexity is None: complexity = max(output_size, input_size*16) # ratios from the default layers = max(complexity // 1024, 1) hidden_size = complexity // 8 // (2*layers) * 2*layers intermediate_size = int(complexity // 2.9) self.config = transformers.LlamaConfig( num_attention_heads=layers, num_hidden_layers=layers, num_key_value_heads=layers, vocab_size=output_size, max_position_embeddings=input_size, hidden_size=hidden_size, intermediate_size=intermediate_size, ) self.model = transformers.LlamaModel(self.config) self.embeddings = torch.nn.Linear(in_features = 1, out_features = self.config.hidden_size) self.output_head = torch.nn.Linear(self.config.hidden_size, self.output_size, bias=False) if load and os.path.exists(f'{name}.pt'): state_dict = torch.load(f'{name}.pt') self.iteration = state_dict.pop('iteration') self.load_state_dict(state_dict) else: self.iteration = 0 def forward(self, input): # possible linear layer to map input to hidden size inputs_embeds = self.embeddings(input[...,None]) output = self.model(inputs_embeds=inputs_embeds).last_hidden_state return self.output_head(output) def generate(self, input, length): # not totally sure about what past key vals needs, but it looks like you could pass it straight from outputs and debug for idx in range(length): inputs_embeds = self.embeddings(input[...,None]) logits = self.model(inputs_embeds=inputs_embeds).last_hidden_state output = self.output_head(logits) # since we do have an output size, we'll want an lm_head input = torch.cat(input, output, dim=-1) return input[...,-length:] # this model no lm_head ! # the above joke retained for humor was made before output_head was added
- I'm surprised I kept the causal mask here. I did make code for removing the causal mask. If you do remove the causal mask, it generates a useful output for every input, as if the model is run separately for each item in its context, which seems much more useful for outputting large data. - I heard its of interest to extend the context to billions or trillions of items (of course if you make a general metamodel at small size it could generalize to larger sizes more flexibly on its own). To do this on limited ram you would need a tighter attention pass. It sounds fun to see if you can do the attention in multiple passes through the whole model by changing the attention kernel to only process the highest impact tokens, and cache their sum for reuse. Concept might simplify to including flash attention. - I mostly don't remember the theories for now other than the infinite scratchpad one that I actually wrote in this thread. So this morning I've been imagining removing the sister models and instead just using one model that outputs an entire other model, by removing the causal mask. I imagine I didn't do this when I made the thread because I wanted more avenues to ensure success via, and I wanted it to stay more similar to existing things. - I'm guessing when you do this the important things to do might include: talking effectively and inclusively about what to do; demonstrating the ability to solve problems to people interested in publicly and inclusively solving them; working on mental and emotional health issues as those drive our decisions; protecting things from harm without stimulating harm.
participants (1)
-
Undescribed Horrific Abuse, One Victim & Survivor of Many