[spam][crazy][ot] log: #3 write-anywhere to an append-only-log
so i'm thinking the big remaining bit to make an algorithm here that i like is to implement moving parts into balanced/full trees, away from recent writes. maybe i'll just accumulate little bits around that. can use chunk structure: class Chunk: start, end, data, height, leaf_count, age def __len__( def is_leaf( def is_full( new basic concept for add() function: retain areas separate from writes, that are shrunk or grown to maximum depth using full trees only goal: write that part of an add() function
Entry structure and Flush structure derive from Chunk. Entry wraps either one, providing a view as if bounds are limited.
def add(new_write): # hmm ... thinking of: the nearest write to the left and right of new_write # seems we'll want a sorted list of write indices, like in version #1
# all subtrees are kept as full trees. def add(new_write): # assume there is a sorted list of writes # merge this in. retain what the lengths of non-write data are to the left and right # if the writes trim non-write entries, then: # on the left, walk left-to-right, and on the right, walk right-to-left, covering the trimmed lengths # when between two writes, consider only the length that goes through the largest full tree between them: so as to leave smaller trees near the more distant write, to merge with a future flush, and not have unnecessary depth # there might be a little more to think about and adjust here, regarding producing these subtrees well. not sure. # when between a write and the edge of the data, consider the entire length # expand out full trees from the trimmed lengths, and merge them into the more distant full trees when they are equal or larger to those more distant full subtrees
there might be some merit to building the index on the fly rather than incrementally. basically, you want the biggest full subtrees between the writes, where these full subtrees are taken from the content of the last flush. the reason to have a full subtree is to limit the depth and sibling node count together. however, interestingly since some leaves end up on internodes, one could construct a subtree with the same depth that has more leaves than a full subtree would. in practice, the full subtree approach limits the sibling node count, because the extra subtrees that are in excess of the full count can be merged together with future writes, to make another equally-sized full subtree. :S what does that mean in a depth-focused approach
so basically an important property is not whether the tree is full or not, it is rather - whether it is refrained from exceeding the max depth - where parts of it are pulled out in order to produce that limit the parts of it that are pulled out should go near future writes, so as to combine with them and make a balanced tree depth > fullness
i guess it would be good to retain is_full() in parallel with height <= max_depth since the concept is partial . these are just metrics that can be applied to portions of the graph, in order to judge what to collect where. the metrics are basically the same.
current plan is to consider flushes as lists of regions of subtrees, terminated by either new writes or end-of-stream. then concepts around how to best consolidate those can develop around member functions of the lists.
Flush: - an updatable tree containing writes and indexes, designed to be useful for syncing with a storage medium. contains a list of subindexes SubIndex - a list of bounded regions (flush entries) of previous flushes, each held as a useful tree. contains links to its left and write to new writes contained in the flush.
class FlushRegion: def __init__(self, *entries, left_write = None, right_write = None): for left, right in zip(entries[:-1], entries[1:]): assert left.end <= right.start if left_write is not None and len(entries): assert left_write.end <= entries[0].start if right_write is not None: assert left_write.end <= right_write.start if right_write is not None and len(entries): assert right_write.start >= entries[-1].end self.entries = entries self.left_write = left_write self.right_write = right_write # i'm thinking of also renaming Flush.Entry to ChunkRegion to help improve habits around generalisation i think it makes sense for me to go back to thinking_of_2, and change the implementation a little to use the above classes. this helps me build ease around integrating the disparate work.
participants (1)
-
Undiscussed Horrific Abuse, One Victim of Many