[ot][spam][crazy] log: append-only random-access data
often i try to implement an append-only random-access data structure. it's a fun puzzle for my confused mind. i never finish it, but today i thought about it a little bit and i think over the years i might be able to put together a simplified form and maybe make it exist some day. here is where i wrote right now. i started spinning out when I renamed Simple.Write to Simple.RangedData . I was in the middle of implementing a leaf iterator, so as to make a simple way to count the number of data leaves and associate them with a depth. the approach, as written in the file, is to simply relink to leaves with excessive depth each flush, so as to maintain O(log n) random seek time. i vaguely recall that approach has a major issue, but i might be remembering wrongly, and it's still quite exciting to be able to spend some minutes trying.
On 7/7/22, Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
often i try to implement an append-only random-access data structure.
clarification: the intention of the structure under design is to be stored on append-only media, but support random writes as well as random reads. similar to tape-based file systems, which unfortunately tend to be very tape-specific.
here's what i have now its intent is to test [the correctness] of iterating leaves when the structure is just appending, without using a tree. there is presently a logic error demonstrated when it is run. there may be many, many logic errors in it. i called the confusing class name "Chunk". seems ok.
i am receiving a mail bomb i don't know how many mail bombs i have received, or how i used to handle them when my mind was more together i imagine it is normal to filter them out.
latest topical post below. code reattached with 1 change that does not fix the run error. On 7/7/22, Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
here's what i have now
its intent is to test [the correctness] of iterating leaves when the structure is just appending, without using a tree.
there is presently a logic error demonstrated when it is run. there may be many, many logic errors in it.
i called the confusing class name "Chunk". seems ok.
here is version with different crash. previous one was because the leaf iterator did not handle sparse data. so i just filled the structure with 0s before starting.
more logic errors engaged the first pass works now thinking it does too many passes to actually make a graph, which is fine for now since flushing after the first is failing.
i added debugging statements and fixed 1 bug i think now it seems it continues yielding leaves after it reaches the end
# find leaf count and leaf depths leaves = [*self.leaves()] max_depth = len(leaves).bit_length() current_entry = None for leaf_depth, leaf_offset, leaf_data in leaves: # leaves that have depth that is too deep can be added to the index so as to find them quickly # or a parent of them could be added # when we find a leaf that is too deep, we can start an index entry # we can then continue moving on, and move the index entry to its parent to include adjacent leaves. # if we reach prev_flush as the only shared parent, then we write out the entry using the last too-deep item as the endpoint # instead if we find another too deep item, we make it the last, and continue on
for leaf_path, leaf_offset, leaf_data in leaves: # copy over the previous index # with: # 1. for areas missing, reference the previous flush for data # 2. for sequences of more than one entry and max depth would not be reached, combine them into a reference to the previous flush # this helps reduce the index size # 3. for individual index entries where the range covers a deep sub entry, copy in the deep sub entry rather than the index # this helps provide for combining shallow entries in the next flush # 4. some concerns remain. they are addressable via similar approaches, such as copying in nested index parts to make the equivalent of a new internode pass
i believe this was the last one that ran without failure. it turned the writes into a linked list: https://lists.cpunks.org/pipermail/cypherpunks/2022-July/102232.html
stable version: https://lists.cpunks.org/pipermail/cypherpunks/2022-July/102232.html i find it cool how i found a solution here of placing asserts practically every line. i'm thinking of how that would help stability of things in general, and with some work maybe other logical projects.
On Sat, Jul 9, 2022, 8:56 AM Undiscussed Horrific Abuse, One Victim of Many <gmkarl@gmail.com> wrote:
stable version: https://lists.cpunks.org/pipermail/cypherpunks/2022-July/102232.html
works now I think there's a little redundancy in use of the Chunk structure not certain would make sense to add bisection to access spots in the sorted lists rather than linear iteration the number of children per node is staying about equal to the tree depth, 1024 flushes of 1-32 lengthy writes to a 4096 byte region
last stable version at https://lists.cpunks.org/pipermail/cypherpunks/2022-July/102291.html although there can be further optimization, I think it makes sense to move a little toward using it for me next step might be to change flushing so the index can be accumulated during writes or mini flushes, so as to be able to limit the total size of an index per flush if needed, without issue. makes sense to clean/reduce/remove the notes and assertions to ease copying algorithm to other languages etc
last stable version at https://lists.cpunks.org/pipermail/cypherpunks/2022-July/102291.html next step might be to change flushing so the index can be accumulated during writes or mini flushes, so as to be able to limit the total size of an index per flush if needed, without issue. probably important to optimize away the approach of reiterating all leaves every flush. would greatly limit write speed on older stores, and be much harder to comprehend when it is old.
last stable version at
https://lists.cpunks.org/pipermail/cypherpunks/2022-July/102291.html next step might be to change flushing so the index can be accumulated
during writes or mini flushes, so as to be able to limit the total size of an index per flush if needed, without issue.
maybe this means retaining a working index as class data, and changing the approach for updating it so that it doesn't really solely on expanding the last entry but rather updates the whole thing probably important to optimize away the approach of reiterating all leaves
every flush. would greatly limit write speed on older stores, and be much harder to comprehend when it is old.
basically this could mean adding state and/or methods to internodes so things like their path and depth don't need to be fully recalculated
https://lists.cpunks.org/pipermail/cypherpunks/2022-July/102291.html
next step might be to change flushing so the index can be accumulated
during writes or mini flushes, so as to be able to limit the total size of an index per flush if needed, without issue.
maybe this means retaining a working index as class data, and changing the approach for updating it so that it doesn't really solely on expanding the last entry but rather updates the whole thing
I think this simplifies to just generating an index that includes the
last stable version at previous flush, and then removing parts replaced by new writes. So it would make sense to place writes into the index. - generate the _next_ index on flush, referencing new writes as entries. probably important to optimize away the approach of reiterating all leaves
every flush. would greatly limit write speed on older stores, and be much harder to comprehend when it is old.
basically this could mean adding state and/or methods to internodes so things like their path and depth don't need to be fully recalculated
still looks important
stable first version: https://lists.cpunks.org/pipermail/cypherpunks/2022-July/102291.html attached appears to have comparable performance to first version. i tried to implement all the same approaches (did not check for certain), except without all the O(n) simplifications and updating the index every write to help decide when to flush. I noticed the branch merging code is never hit, should try to figure out what data would make it get used or if it is redundant or has a further mistake.
next step for me is likely journaling out what conditions would cover the untested code, or what logic error is around having it there
- I added expanding deep branches into every add, in an accumulating way equivalent to recursion. this keeps the tree depth within new bounds when the leaf count drops, and now some branches are merging, covering that code - I made the test a little more pathological by shortening the random writes so they sparsely cover old data more, which widens the tree and slows execution. thinking on addressing this. I started copying the code to C a tiny bit, just a start, not attached. thinking on handling pathological writes just a little - maybe ensuring there are not internodes with single children so more leaves are clustered? - maybe copying leaves up rather than internodes, as investment in future use when the flush is deeper? - considering what property relates to the slowdown: number of nodes traversed? what about in languages with different properties? - considering bounds on things: in a worst-case scenario with an ideal tree, how many new nodes should be attached to a new root, at most might make sense for me to just make some code that uses it, given some of the ambiguity here.
contains bugfixes regarding missed writes and branches when merging. i started looking into append-only behavior (writing only to the last offset) and performance seems much worse than other implementations (every root is increasingly wide), maybe because i am rebalancing the tree by simply attaching branches to the root. it's a little confusing to me nowadays, how when the tree is balanced, if using append only storage, this means adding internodes to the same storage location as the root. forming an equivalency between an ideal tree and the "update" tree where each newly added node as leaves are added all go in the same "update" root node, could be helpful.
This is what i'm thinking of using as a base so as to reuse work when changing the structure to use index regions between writes. It makes large and shallow indexes atm.
Here i have added the snippet from my other thread, without making use of it. This is lengthy and hence may be hard to continue working on. Its length will reduce if I can integrate the parts after integrating the algorithm.
participants (1)
-
Undiscussed Horrific Abuse, One Victim of Many