[ot][spam][crazy][random][crazy]
the other thread engaged a little signal decomposition and deconstruction. i got to add a small algorithm for fourier transforms at non-recorded sample rates :D . thinking a little tiny bit now about the flat-tree thing. there's a thread here i was using, i think multiple threads. also i have work scattered around.
it's hard! i'm so prepared for the other task! some of my circadian bits aren't prepared atm for this. but i think i can do it. i did the ohter thing for 2-3 days !!
here are recent thoughts I posted to this list
- slicing a chunk: easier if an interface for the store is specified, to retrieve written data - optimizing a sequence that includes sliced chunks by reorganising their metadata. i think i included a number of optimizations in one of the attempts that might be copyable out. - selecting which nodes to merge when writing a new chunk so as to make balanced trees. i really simplified code for this in the flat_tree repo.
1. selecting which nodes to merge (xloem/flat_tree append structure, can be improved to work from middle) 2. slicing chunks (means the library must have access to old internodes to pull chunks out of them) 3. optimizing out redundant information when reusing old nodes
I'm pretty sure I memorized a next step. The next step I memorized is to add an interface for data access to xloem/flat_tree . I have some confusions associated with xloem/flat_tree. It makes sense to take it a little slowly.
misunderstood]$ cd ~/src/flat_tree [user@ flat_tree]$ ls flat_tree pyproject.toml random_write_old.py test.py [user@ flat_tree]$ git pull remote: Enumerating objects: 7, done. remote: Counting objects: 100% (7/7), done. remote: Compressing objects: 100% (2/2), done. remote: Total 4 (delta 2), reused 4 (delta 2), pack-reused 0 Unpacking objects: 100% (4/4), 491 bytes | 32.00 KiB/s, done.
From https://github.com/xloem/flat_tree e3177e2..479a424 main -> origin/main Updating e3177e2..479a424 error: Your local changes to the following files would be overwritten by merge: flat_tree/mix_indices_oo.py Please commit your changes or stash them before you merge. Aborting [user@ flat_tree]$
looks like i worked on it in two separate places. slows things. but i might have days to make progress on this! a number of files have changed locally. the fact that they're uncommitted implies they may not be stable. it can help a little to remember that i have strategies for engaging any possible design problem here. although i might note that some can be much harder than others; some are also easier than it feels like, by far.
Here's the diff of the file git is specifically complaining about: diff --git a/flat_tree/mix_indices.py b/flat_tree/mix_indices.py index d49ec94..6742218 100644 --- a/flat_tree/mix_indices.py +++ b/flat_tree/mix_indices.py @@ -19,35 +19,30 @@ class append_indices(list): assert spliced_out_start == self.size # until testing truncation appends assert spliced_out_stop == self.size - running_size = 0 - running_leaf_count = 0 + balanced_size = 0 + balanced_leaf_count = 0 #leaf_count_of_partial_index_at_end_tmp = 0 #proposed_leaf_count = self.leaf_count - leaf_count_of_partial_index_at_end_tmp #new_node_leaf_count = self.leaf_count # + 1 - - new_leaf_count = self.leaf_count - new_size = self.size for idx, (branch_leaf_count, branch_offset, branch_size, branch_id) in enumerate( self): - if branch_leaf_count * self.degree <= new_leaf_count: #proposed_leaf_count + if branch_leaf_count * self.degree <= self.leaf_count - balanced_leaf_count: #proposed_leaf_count break #if new_node_offset + branch_size > spliced_out_start: # break - running_size += branch_size - running_leaf_count += branch_leaf_count + balanced_size += branch_size + balanced_leaf_count += branch_leaf_count #proposed_leaf_count -= branch_leaf_count #new_node_leaf_count -= branch_leaf_count #new_total_leaf_count += branch_leaf_count - new_leaf_count -= branch_leaf_count - new_size -= branch_size else: idx = len(self) - assert new_size == sum((size for leaf_count, offset, size, value in self[idx:])) + assert self.size - balanced_size == sum((size for leaf_count, offset, size, value in self[idx:])) self[idx:] = ( - #(leaf_count_of_partial_index_at_end_tmp, running_size, spliced_out_start - running_size, last_publish), - (new_leaf_count, running_size, new_size, last_publish), + #(leaf_count_of_partial_index_at_end_tmp, balanced_size, spliced_out_start - balanced_size, last_publish), + (self.leaf_count - balanced_leaf_count, balanced_size, self.size - balanced_size, last_publish), (-1, 0, spliced_in_size, spliced_in_data) )
On 11/12/22, Undescribed Horrific Abuse, One Victim & Survivor of Many <gmkarl@gmail.com> wrote:
Here's the diff of the file git is specifically complaining about:
diff --git a/flat_tree/mix_indices.py b/flat_tree/mix_indices.py index d49ec94..6742218 100644 --- a/flat_tree/mix_indices.py +++ b/flat_tree/mix_indices.py @@ -19,35 +19,30 @@ class append_indices(list): assert spliced_out_start == self.size # until testing truncation appends assert spliced_out_stop == self.size
- running_size = 0 - running_leaf_count = 0 + balanced_size = 0 + balanced_leaf_count = 0
I'm guessing this change is just naming clarity.
#leaf_count_of_partial_index_at_end_tmp = 0 #proposed_leaf_count = self.leaf_count - leaf_count_of_partial_index_at_end_tmp #new_node_leaf_count = self.leaf_count # + 1 - - new_leaf_count = self.leaf_count - new_size = self.size
Reasoning for this likely comes out later on in the file. Looks harmless so far.
for idx, (branch_leaf_count, branch_offset, branch_size, branch_id) in enumerate( self): - if branch_leaf_count * self.degree <= new_leaf_count: #proposed_leaf_count + if branch_leaf_count * self.degree <= self.leaf_count - balanced_leaf_count: #proposed_leaf_count break
Okay, I chose to remove the local variables and reference the members directly. I guess that's harmless.
#if new_node_offset + branch_size > spliced_out_start: # break - running_size += branch_size - running_leaf_count += branch_leaf_count + balanced_size += branch_size + balanced_leaf_count += branch_leaf_count
Here's a propagation of the naming clarity change. Naming clarity changes are likely good things. I probably changed this to help me remember what was going on when I was confused.
#proposed_leaf_count -= branch_leaf_count #new_node_leaf_count -= branch_leaf_count #new_total_leaf_count += branch_leaf_count - new_leaf_count -= branch_leaf_count - new_size -= branch_size
Okay, these are removed because the value is calculated from member variables, above.
else: idx = len(self)
- assert new_size == sum((size for leaf_count, offset, size, value in self[idx:])) + assert self.size - balanced_size == sum((size for leaf_count, offset, size, value in self[idx:]))
Here, again, making a calculation based on member variables, rather than caching it with local variables. Looks equivalent. If I have to merge something here, I'll probably want to memorise this change I made, so as to understand how to unify it with something else, or whether to choose to reject it.
self[idx:] = ( - #(leaf_count_of_partial_index_at_end_tmp, running_size, spliced_out_start - running_size, last_publish), - (new_leaf_count, running_size, new_size, last_publish), + #(leaf_count_of_partial_index_at_end_tmp, balanced_size, spliced_out_start - balanced_size, last_publish), + (self.leaf_count - balanced_leaf_count, balanced_size, self.size - balanced_size, last_publish), (-1, 0, spliced_in_size, spliced_in_data) )
And finally, at the end, um ..... looks like I just made the same change to calculate based on member variables rather than tracking a local variable. I guess it's harmless enough. Seems like it might help one think about bounds of thread safety a little.
Well, this is unfortunate: $ python3 test.py b'\xacL\t\xc2\x06\xe7\xe34/]\n^HBx\xc3\xa9g\xb3g\x11\xeb\x87\xbd\xc1\xf7z\x7f\x86X\x95C\xe4v(3\x81\x86&Hb|\xc1M\x9e\xc3\x04\xbd\xfc\xe8\xd4eO\xd6\xba\x1a\xe6WX\xdexO\x8f\x92ZUD\xf2P\x16c\x9c\xc1\x83s\xf7\x13ZTq\x9e\xf3\x84\xdd\x02\x8e\xf1\x0f\xce\x17\x16+\xd3\xb5\n)\xd3\x9cz\xf4\xfb\xd3*7\x1b\xdej\xf1p>\xf3\xba' [(9, 9, 0, 45), (9, 18, 45, 39), (3, 21, 84, 13), (1, 22, 97, 1), (1, 23, 98, 8), (-1, b'j\xf1p>\xf3\xba', 106, 6)] Traceback (most recent call last): File "/shared/src/flat_tree/test.py", line 43, in <module> cmp = b''.join(iterate(index, 0, index.byte_count)) File "/shared/src/flat_tree/test.py", line 30, in iterate data = b''.join(iterate(subindex, adjusted_start, min(adjusted_end, subsize - adjusted_start))) File "/shared/src/flat_tree/test.py", line 34, in iterate assert len(data) == subleafcount AssertionError
I'm really in the middle of a lot of stuff here! $ vim test.py E325: ATTENTION Found a swap file by the name ".test.py.swp" owned by: user dated: Sun Sep 04 04:02:22 2022 file name: /shared/src/flat_tree/test.py modified: no user name: user host name: DESKTOP-E3P7TC0 process ID: 18728 While opening file "test.py" dated: Sun Sep 04 04:02:17 2022 (1) Another program may be editing the same file. If this is the case, be careful not to end up with two different instances of the same file when making changes. Quit, or continue with caution. (2) An edit session for this file crashed. If this is the case, use ":recover" or "vim -r test.py" to recover the changes (see ":help recovery"). If you did this already, delete the swap file ".test.py.swp" to avoid this message. Swap file ".test.py.swp" already exists! [O]pen Read-Only, (E)dit anyway, (R)ecover, (D)elete it, (Q)uit, (A)bort:
$ git status test.py On branch main Your branch is behind 'origin/main' by 1 commit, and can be fast-forwarded. (use "git pull" to update your local branch) Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: test.py no changes added to commit (use "git add" and/or "git commit -a")
[user@ flat_tree]$ cp test.py test.py.bak [user@ flat_tree]$ vim -r test.py Using swap file ".test.py.swp" Original file "/shared/src/flat_tree/test.py" Recovery completed. Buffer contents equals file contents. You may want to delete the .swp file now. :wq [user@ flat_tree]$ diff -u test.py test.py.bak # no changes detected from swapfile [user@ flat_tree]$ rm test.py.bak .test.py.swp
so, in flat tree, i left some parts open for the different states of mind i have that managed to get that far so as to work on a reusable (and public) part. right now test.py is importing flat_tree from flat_tree #import append_indices #import mix_indices as append_indices from flat_tree import flat_tree#_mix_indices_oo as flat_tree but it used to import append_indices, mix_indices, etc. it looks like i migrated to referencing the flat_tree module for the different files ... it looks like i'm just using a flat_tree class from flat_tree, which itself must wrap the backendy things i tried to start. index = flat_tree(degree=3) # this is the only mention of the import so, on to flat_tree.py . I wonder if it has pending state too.
[user@ flat_tree]$ vim flat_tree/__init__.py then, down at the bottom ... #flat_tree = flat_tree_mix_indices_oo flat_tree = flat_tree_append_indices So the test isn't actually using mix_indices. It's instead using append_indices, the older form that is supposed to be more stable. I must have broken it, then resonated discouragement with inhibition etc and failed to continue, had amnesia, etc etc.
That would be me if I ever had an advanced programming job before I went crazy. ("I'll debug it !!!!"). I'm such a debugger. I was excited when I discovered memory errors and compiler bugs. My strategy of always trying to come up with a new best way of doing everything meant I ran into bugs in everything I used, since I wasn't doing anything the way anybody else did. Open source software is a godsend in such a situation, and I am not going back to debugging microsoft word in assembly. Nor would I make the slightest headway at that in my modern state of mind.
left out: I run into _way_ more bugs now, than I did back then. It's like absolutely everything is a bug. Things break faster than I can use them. Something has changed. I suspect multiple things have changed. Here's the test failure again that I'm planning to slowly plod into: $ python3 test.py b'\xacL\t\xc2\x06\xe7\xe34/]\n^HBx\xc3\xa9g\xb3g\x11\xeb\x87\xbd\xc1\xf7z\x7f\x86X\x95C\xe4v(3\x81\x86&Hb|\xc1M\x9e\xc3\x04\xbd\xfc\xe8\xd4eO\xd6\xba\x1a\xe6WX\xdexO\x8f\x92ZUD\xf2P\x16c\x9c\xc1\x83s\xf7\x13ZTq\x9e\xf3\x84\xdd\x02\x8e\xf1\x0f\xce\x17\x16+\xd3\xb5\n)\xd3\x9cz\xf4\xfb\xd3*7\x1b\xdej\xf1p>\xf3\xba' [(9, 9, 0, 45), (9, 18, 45, 39), (3, 21, 84, 13), (1, 22, 97, 1), (1, 23, 98, 8), (-1, b'j\xf1p>\xf3\xba', 106, 6)] Traceback (most recent call last): File "/shared/src/flat_tree/test.py", line 43, in <module> cmp = b''.join(iterate(index, 0, index.byte_count)) File "/shared/src/flat_tree/test.py", line 30, in iterate data = b''.join(iterate(subindex, adjusted_start, min(adjusted_end, subsize - adjusted_start))) File "/shared/src/flat_tree/test.py", line 34, in iterate assert len(data) == subleafcount AssertionError
b'\xacL\t\xc2\x06\xe7\xe34/]\n^HBx\xc3\xa9g\xb3g\x11\xeb\x87\xbd\xc1\xf7z\x7f\x86X\x95C\xe4v(3\x81\x86&Hb|\xc1M\x9e\xc3\x04\xbd\xfc\xe8\xd4eO\xd6\xba\x1a\xe6WX\xdexO\x8f\x92ZUD\xf2P\x16c\x9c\xc1\x83s\xf7\x13ZTq\x9e\xf3\x84\xdd\x02\x8e\xf1\x0f\xce\x17\x16+\xd3\xb5\n)\xd3\x9cz\xf4\xfb\xd3*7\x1b\xdej\xf1p>\xf3\xba'
This is its test data. It wrote these random bytes into the tree, and verifies that they come back right.
[(9, 9, 0, 45), (9, 18, 45, 39), (3, 21, 84, 13), (1, 22, 97, 1), (1, 23, 98, 8), (-1, b'j\xf1p>\xf3\xba', 106, 6)]
This looks like the tree index after the final write. The tuples are edges linking to older indices (subtrees). The final component is the most recent data, and some metadata regarding it.
Traceback (most recent call last): File "/shared/src/flat_tree/test.py", line 43, in <module> This is the test file.
cmp = b''.join(iterate(index, 0, index.byte_count))
Okay, insid the outer test file, it is iterating over an index, the range being the entire data, and joining the bytes together. On line 43. So it hasn't compared the data yet: it's inside the function that returns it.
File "/shared/src/flat_tree/test.py", line 30, in iterate data = b''.join(iterate(subindex, adjusted_start, min(adjusted_end, subsize - adjusted_start)))
Looks like the iterate function is local to the test file, and it looks like it calls itself recursively. I'm vaguely remembering that the library presently just generates indices and doesn't provide much of an interface to them. So here it is doing a recursive call. Given the call up the stack is in the main file body, this is likely iterating at one level below the root index. It would be appropriate to know the values of the parameters passed to the function here.
File "/shared/src/flat_tree/test.py", line 34, in iterate assert len(data) == subleafcount AssertionError
Now it's _still_ inside the iterate function in the main file, and an assertion is failing. And now I have helpful information: this assertion must be failing regarding data in the second row of the tree. This helps me so much as before I had no context for the failure. len(data) == subleafcount implies that data is a list of all the trees below the tree. So it likely already performed a depth-first enumeration of every child of this node, and is preparing to return out to the parent call. Time to visit it in pdb.
[user@ flat_tree]$ python3 -m pdb test.py
/shared/src/flat_tree/test.py(3)<module>() -> from flat_tree import flat_tree#_mix_indices_oo as flat_tree (Pdb) cont
... File "/shared/src/flat_tree/test.py", line 34, in iterate assert len(data) == subleafcount AssertionError Uncaught exception. Entering post mortem debugging Running 'cont' or 'step' will restart the program
/shared/src/flat_tree/test.py(34)iterate() -> assert len(data) == subleafcount (Pdb) p len(data), subleafcount (1, 3) (Pdb) p data [b'\xc3\x04\xbd\xfc\xe8']
(Pdb) list 20 15 index.append(id, len(chunk), chunk) 16 17 def iterate(index, start_offset, end_offset): 18 subendoffset = 0 19 for subleafcount, subid, suboffset, subsize in index: 20 substartoffset = subendoffset 21 subendoffset += subsize (Pdb) p index, start_offset, end_offset, subleafcount, subid, suboffset, subsize ([(9, 9, 0, 45), (3, 12, 45, 10), (3, 15, 55, 10), (1, 16, 65, 7), (1, 17, 72, 4), (-1, b'\x13ZTq\x9e\xf3\x84\xdd', 76, 8)], 45, -6, 3, 12, 45, 10) let's separate those out index [(9, 9, 0, 45), (3, 12, 45, 10), (3, 15, 55, 10), (1, 16, 65, 7), (1, 17, 72, 4), (-1, b'\x13ZTq\x9e\xf3\x84\xdd', 76, 8)] start_offset 45 end_offset -6 subleafcount 3 subid 12 suboffset 45 subsize 10 that's weird that end_offset is negative, when accessing an append_only structure
it's so exciting to work on this isn't it!!! yes!!!! isn't it!!! i think maybe excitement might help me continue
(Pdb) up > /shared/src/flat_tree/test.py(30)iterate() -> data = b''.join(iterate(subindex, adjusted_start, min(adjusted_end, subsize - adjusted_start))) (Pdb) list 25 yield data 26 else: 27 subindex = stored_indices[subid] 28 adjusted_start = start_offset - substartoffset + suboffset 29 adjusted_end = end_offset - substartoffset + suboffset 30 -> data = b''.join(iterate(subindex, adjusted_start, min(adjusted_end, subsize - adjusted_start))) 31 data = list(iterate(subindex, adjusted_start, min(adjusted_end, subsize + suboffset))) Immediately, it looks like I somehow uncommented two recursive calls that imply conflicting things. It's clear only one of those assignments to data should be uncommented. Given the length of data is checked, the b''.join() call would be a wrong one. It's notable that the last parameter in the two calls is different, unfortunately. So, which one is correct? Probably hte other, given there's a negative offset in this use? I commented out the recursive call it crashed in, and the tests now pass.
[user@ flat_tree]$ python3 test.py b'\xacL\t\xc2\x06\xe7\xe34/]\n^HBx\xc3\xa9g\xb3g\x11\xeb\x87\xbd\xc1\xf7z\x7f\x86X\x95C\xe4v(3\x81\x86&Hb|\xc1M\x9e\xc3\x04\xbd\xfc\xe8\xd4eO\xd6\xba\x1a\xe6WX\xdexO\x8f\x92ZUD\xf2P\x16c\x9c\xc1\x83s\xf7\x13ZTq\x9e\xf3\x84\xdd\x02\x8e\xf1\x0f\xce\x17\x16+\xd3\xb5\n)\xd3\x9cz\xf4\xfb\xd3*7\x1b\xdej\xf1p>\xf3\xba' [(9, 9, 0, 45), (9, 18, 45, 39), (3, 21, 84, 13), (1, 22, 97, 1), (1, 23, 98, 8), (-1, b'j\xf1p>\xf3\xba', 106, 6)] b'\xacL\t\xc2\x06\xe7\xe34/]\n^HBx\xc3\xa9g\xb3g\x11\xeb\x87\xbd\xc1\xf7z\x7f\x86X\x95C\xe4v(3\x81\x86&Hb|\xc1M\x9e\xc3\x04\xbd\xfc\xe8\xd4eO\xd6\xba\x1a\xe6WX\xdexO\x8f\x92ZUD\xf2P\x16c\x9c\xc1\x83s\xf7\x13ZTq\x9e\xf3\x84\xdd\x02\x8e\xf1\x0f\xce\x17\x16+\xd3\xb5\n)\xd3\x9cz\xf4\xfb\xd3*7\x1b\xdej\xf1p>\xf3\xba' [user@ flat_tree]$ echo $? 0 [user@ flat_tree]$ git status On branch main Your branch is behind 'origin/main' by 1 commit, and can be fast-forwarded. (use "git pull" to update your local branch) Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: flat_tree/__init__.py modified: flat_tree/append_indices.py modified: flat_tree/mix_indices.py modified: flat_tree/mix_indices_oo.py modified: test.py Untracked files: (use "git add <file>..." to include in what will be committed) flat_tree/__pycache__/ no changes added to commit (use "git add" and/or "git commit -a") [user@ flat_tree]$ _
I'm looking at the changes to __init__.py . ... [user@ flat_tree]$ git add flat_tree/__init__.py flat_tree/append_indices.py test.py [user@ flat_tree]$ git commit -m 'uncommitted changes related to append_indices. the changes to the subfolder files are dated Sep 4.' [main e7fa3d8] uncommitted changes related to append_indices. the changes to the subfolder files are dated Sep 4. 3 files changed, 32 insertions(+), 13 deletions(-) A merge stlil won't work because I haven't committed mix_indices, but I recall its changes are reasonable.
[user@ flat_tree]$ git add flat_tree/mix_indices.py [user@ flat_tree]$ git commit -m 'uncommitted changes to mix_indices dated Sep 4 03:50' [main 1945353] uncommitted changes to mix_indices dated Sep 4 03:50 1 file changed, 8 insertions(+), 13 deletions(-) [user@ flat_tree]$ git pull error: Your local changes to the following files would be overwritten by merge: flat_tree/mix_indices_oo.py Please commit your changes or stash them before you merge. Aborting [user@ flat_tree]$ _ I guess I remember or saw something wrong. I thought there was only one file there was a problem with, and it was mix_indices.py . I must have seen it wrong. Either way, now I can address this file. mix_indices_oo.py is more in flux. It contains a lot of half-starts that I then got [painfully strong avoidance influence] around. I would shift them down and try something else. I was kind of hoping to sidestep, but there is no need to!
yeah looks like both files had the same changes happen to them 1905. Merge branch 'main' of https://github.com/xloem/flat_tree Auto-merging flat_tree/mix_indices_oo.py Merge made by the 'ort' strategy. flat_tree/mix_indices_oo.py | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) [user@ flat_tree]$ python3 test.py b'\xacL\t\xc2\x06\xe7\xe34/]\n^HBx\xc3\xa9g\xb3g\x11\xeb\x87\xbd\xc1\xf7z\x7f\x86X\x95C\xe4v(3\x81\x86&Hb|\xc1M\x9e\xc3\x04\xbd\xfc\xe8\xd4eO\xd6\xba\x1a\xe6WX\xdexO\x8f\x92ZUD\xf2P\x16c\x9c\xc1\x83s\xf7\x13ZTq\x9e\xf3\x84\xdd\x02\x8e\xf1\x0f\xce\x17\x16+\xd3\xb5\n)\xd3\x9cz\xf4\xfb\xd3*7\x1b\xdej\xf1p>\xf3\xba' [(9, 9, 0, 45), (9, 18, 45, 39), (3, 21, 84, 13), (1, 22, 97, 1), (1, 23, 98, 8), (-1, b'j\xf1p>\xf3\xba', 106, 6)] b'\xacL\t\xc2\x06\xe7\xe34/]\n^HBx\xc3\xa9g\xb3g\x11\xeb\x87\xbd\xc1\xf7z\x7f\x86X\x95C\xe4v(3\x81\x86&Hb|\xc1M\x9e\xc3\x04\xbd\xfc\xe8\xd4eO\xd6\xba\x1a\xe6WX\xdexO\x8f\x92ZUD\xf2P\x16c\x9c\xc1\x83s\xf7\x13ZTq\x9e\xf3\x84\xdd\x02\x8e\xf1\x0f\xce\x17\x16+\xd3\xb5\n)\xd3\x9cz\xf4\xfb\xd3*7\x1b\xdej\xf1p>\xf3\xba' [user@ flat_tree]$ echo $? 0 [user@ flat_tree]$ git push Enumerating objects: 32, done. Counting objects: 100% (28/28), done. Delta compression using up to 4 threads Compressing objects: 100% (18/18), done. Writing objects: 100% (18/18), 2.46 KiB | 2.46 MiB/s, done. Total 18 (delta 9), reused 0 (delta 0), pack-reused 0 remote: Resolving deltas: 100% (9/9), completed with 5 local objects. To https://github.com/xloem/flat_tree 479a424..7e9ff54 main -> main I should now test my "log" repository, which depends on this one.
[user@ log]$ echo 'Hello, universe!' | python3 capture_stdin.py warning: this script hopefully works but drops chunks due to waiting on network and not buffering input Capturing ... {"ditem": ["xh3tnvuWDvIW-oCppJDT0I-g4mGPvxbefQQk91Vp1eQ"], "min_block": [1056137, "1Y6t7APOqSpYeZM8XUL78webqgvrnVzr3gFKhlKscygSmdtyPVBT7VulnSzzhkYk"], "api_block": 1056537} Traceback (most recent call last): File "/shared/src/log/capture_stdin.py", line 65, in <module> api_block = data_array[-1]['block'], IndexError: list index out of range I guess I'll debug this error!
1909 I ran it interactively, and the crash happens after the input stream is closed. Traceback (most recent call last): File "/usr/local/lib/python3.10/pdb.py", line 1726, in main pdb._runscript(mainpyfile) File "/usr/local/lib/python3.10/pdb.py", line 1586, in _runscript self.run(statement) File "/usr/local/lib/python3.10/bdb.py", line 597, in run exec(cmd, globals, locals) File "<string>", line 1, in <module> File "/shared/src/log/capture_stdin.py", line 65, in <module> api_block = data_array[-1]['block'], IndexError: list index out of range Uncaught exception. Entering post mortem debugging Running 'cont' or 'step' will restart the program
/shared/src/log/capture_stdin.py(65)<module>() -> api_block = data_array[-1]['block'], (Pdb) p data_array []
For this capture_stdin.py, I actually made a python project with plan to add to it, to make things better. I think it handles dropped input buffers better or something. Not sure immediately where it is. Focusing on flat_tree. 1910 (Pdb) list 50 45 46 current_block = peer.current_block() 47 last_time = time.time() 48 49 while True: 50 raw = capture.read(100000*16)#100000) 51 data_array = [] 52 for offset in range(0,len(raw),100000): 53 data_array.append(send(raw[offset:offset+100000])) 54 if time.time() > last_time + 60: 55 current_block = peer.current_block() (Pdb) p len(raw) 0 Seems like it should terminate when len(raw) is 0.
[user@ log]$ echo 'Hello, Universe!' | python3 capture_stdin.py warning: this script hopefully works but drops chunks due to waiting on network and not buffering input Capturing ... {"ditem": ["8sONkszk1eyZI946JUgc0-4Na1uTcoormJvXhGZrBNs"], "min_block": [1056138, "heexX--aX8w9eRODqdvw0He7qPYLQC4VgKJ3UQ4w8KCB8q-oy1J106ao6lDhlSty"], "api_block": 1056538} [user@ log]$ echo $? 0 To test the flat_tree behavior, I'll want more than one node, so I'll send the data through pv to ratelimit it.
1914 1915 looks like it is blocking and reading 1.6MaB chunks of data at once. I can hack it to read only 4 bytes. [user@ log]$ echo 'Hello, Universe!' | python3 capture_stdin.py warning: this script hopefully works but drops chunks due to waiting on network and not buffering input Capturing ... {"ditem": ["p-hxoXSPCy1DM8zJWg4GUtctZxYZDheLPhEFqMsOInY"], "min_block": [1056139, "UOK0KqI17Lsg37QwPUBmSLodqcirmZ1KuaanCeTksrDLmtiS4xWPLRy6ErGOP7C2"], "api_block": 1056539} {"ditem": ["T7UNzpyArtyZU40dS9Z4zJ0gY1cTmyWLtDvrmWIls00"], "min_block": [1056139, "UOK0KqI17Lsg37QwPUBmSLodqcirmZ1KuaanCeTksrDLmtiS4xWPLRy6ErGOP7C2"], "api_block": 1056539} {"ditem": ["d_kmualjoqspfSQxyZPXaNRMn-jDDlhQNZuEjOGA0Jw"], "min_block": [1056139, "UOK0KqI17Lsg37QwPUBmSLodqcirmZ1KuaanCeTksrDLmtiS4xWPLRy6ErGOP7C2"], "api_block": 1056539} {"ditem": ["9Y4GsFXDP9nY2DwqtHGH0i4oLLjZQIoHLcNek0qKuf8"], "min_block": [1056139, "UOK0KqI17Lsg37QwPUBmSLodqcirmZ1KuaanCeTksrDLmtiS4xWPLRy6ErGOP7C2"], "api_block": 1056539} {"ditem": ["2vVoAxb298bO9j8kJ4PpksaZAYKFnBRSzn_UOVZekMQ"], "min_block": [1056139, "UOK0KqI17Lsg37QwPUBmSLodqcirmZ1KuaanCeTksrDLmtiS4xWPLRy6ErGOP7C2"], "api_block": 1056539} The last output data is the root node. I've started the download and it's caching blocks to try to find the data. Presently it has to be mined for it to find it. [user@ log]$ echo '{"ditem": ["2vVoAxb298bO9j8kJ4PpksaZAYKFnBRSzn_UOVZekMQ"], "min_block": [1056139, "UOK0KqI17Lsg37QwPUBmSLodqcirmZ1KuaanCeTksrDLmtiS4xWPLRy6ErGOP7C2"], "api_block": 1056539}' > hellouniverse.json [user@ log]$ python3 download.py hellouniverse.json Caching 1056140: 100%|████████████████████████████████████████| 86/86 [00:11<00:00, 7.66tx/s] But, I forgot to update flat_tree. So this isn't actually testing it. 1918 [user@ log]$ pip3 install --upgrade git+https://github.com/xloem/flat_tree@main [user@ log]$ echo '{"ditem": ["9fHmJ9utcCSAYPVPGUau98J7fl1i-lI9E8TvvQsx6CA"], "min_block": [1056141, "PvRhlqyLRwsUjkFLL19PYc872olmvDXbgfizaWLzMasIce4tnsg3JILUOuVHFU7U"], "api_block": 1056541}' > hellouniverse.json It ran with the new flat_tree ! To give it a little time to consider being mined, I'll ensure the script actually _uses_ flat_tree and doesn't craft the structure by hand. from flat_tree import flat_tree indices = flat_tree(3) #append_indices(3) It looks like it's working.
I'm trying to download it. It still has [legacy?] code that checks the pending txs. Arweave doesn't provide access to data that's in pending txs, as far as I know. I think you need to listen on ports to get the data ... it might even need to come from users who upload it directly, I'm not sure. At least Arweave exists, even if it's full of compromises. My usual strategy is to repeat the download, waiting for blocks to mine. It's downloading very slowly, which implies (i guess it's still using a gateway?) the thing it is contacting is responding slowly for some request or another. I kind of use HTTP things that aren't fully documented to read only parts of the data. The situation of how those work may have changed. But it seems more likely the data is just still propagating. Not sure. I'll step into it and see where the holdup is. 1924 I hit ctrl-C when it was paused, and it was actually in CPU execution rather than network blocking. it's parsing a _huge_ bundle. It's on bundled item # 253681 . That is a lot of items. Due to a bug in pdb it's not showing me the total item count. 1926 it looks like this large bundle is GK_ffFFjA39-FuvzXoJ47CQvon6uWYRYix55LGcj0tk that it is slowly parsing. for each entry, it seeks to the area in the bundle and parses the header to read the tags. this generally means an http request for partial data, or to download the data chunk that contains the header. 1927 It's at 2150646 . That's an order of magnitude larger than before. I'm wondering if it could have overflowed something. (Pdb) p stream.tell() 91133 (Pdb) p idx 2150646 That doesn't look right. Here's the code it's executing: 82 length_id_pairs = ( 83 (int.from_bytes(stream.read(32), 'little'), b64enc(stream.read(32))) 84 -> for idx in range(entryct) 85 ) I stepped into the stream.read function: 44 def readinto(self, b): 45 -> if self.offset >= self.end: 46 return 0 47 48 if ( 49 self.chunk is None or 50 self.chunk.start_offset > self.offset or (Pdb) p self.offset 91133 (Pdb) p self.end 91133 Oops.
So, the downloader is failing, and without the downloader working it's hard to test that flat_tree is good! The part of the downloader that is failing presently seems unrelated to flat_tree. I'm stepping up in the callstack. Because by happenstance I made the length_id_paris a generator expression rather than a list, it's actually executing code farther down the function when the variable is used. It's evaluating lazily. When I get far enough up, I can evaluate entryct. (Pdb) p entryct 16503103374363018573310452198762115323676072158390965418376578654689494076191 It looks like there is an error parsing the bundle, that is not being caught. It's possible the bundle format changed! Or it could just be corrupt data.
Ok, the bug is indeed on my end. At https://viewblock.io/arweave/tx/GK_ffFFjA39-FuvzXoJ47CQvon6uWYRYix55LGcj0tk one can see that this transaction is not a bundle. It's gzipped data from KYVE, a data streaming project that I haven't looked into extensively. 1935 So, for some reason my code thinks this transaction is a bundle. I've restarted pdb and stepped into the code that lists the bundles. (Pdb) n
/shared/src/log/download.py(141)block_bundles() -> self._cache_block(block)
Guess I'd better check that function out (Pdb) l 136 if block == current_height + 1: 137 pending = self.peer.tx_pending() 138 return self._txs2bundles(pending, f'{len(pending)} pending txs', unconfirmed = True) 139 else: 140 raise KeyError(f'block {block} does not exist yet') 141 -> self._cache_block(block) 142 bundles = self.bundle_cache[block] 143 return bundles 144 def block_height(self, block): 145 if type(block) is list: 146 for block in block: Looks like the bundles are indeed enumerated in the _cache_block function. (Pdb) list 119 tags = self.peer.tx_tags(txid) 120 if any((ar.utils.b64dec_if_not_bytes(tag['name']).startswith(b'Bundle') for tag in tags)): 121 bundles.append(txid) 122 return bundles 123 def _cache_block(self, block): 124 -> block = self.fetch_block(block) 125 bundles = self._txs2bundles(block.txs, f'Caching {block.height}') 126 self.bundle_cache[block.height] = bundles 127 self.height_cache[block.indep_hash] = block.height 128 return block.height, bundles 129 def block_bundles(self, block): Now I'm in txs2bundles and I found where it selects bundles and placed a breakpoint. (Pdb) list 116 if unconfirmed: 117 tags = Transaction.frombytes(self.peer.unconfirmed_tx2(txid)).tags 118 else: 119 tags = self.peer.tx_tags(txid) 120 if any((ar.utils.b64dec_if_not_bytes(tag['name']).startswith(b'Bundle') for tag in tags)): 121 B-> bundles.append(txid) 122 return bundles 123 def _cache_block(self, block): 124 block = self.fetch_block(block) 125 bundles = self._txs2bundles(block.txs, f'Caching {block.height}') 126 self.bundle_cache[block.height] = bundles (Pdb) p txid 'gF9DQa99YKsNKeDHQLOT2inuQuBGvLUtTsGD9NMeVM8' That's the first bundle it thinks it found in this block. The filter for what is a bundle is visible here too. It looks for tags with names that start with the string "Bundle". Maybe the KYVE transaction matches that. BundleSize 100 FromKey 3638932 ToKey 3639031 BundleSummary 0x4463f4562edc86bc82521cae492ea54a6793efa3d41f654e3a60348e4abaf6ab Yeah. This tx isn't a bundle. I can check the ANS-104 and ANS-102 specs to see what is a bundle. I could also add checking code to my implementation so it is more user-friendly in the future :S this would be a good idea, but since this isn't actually my current task I might not this time :S which is maybe not the best idea, unsure. Arweave bundle specs: https://github.com/ArweaveTeam/arweave-standards/blob/master/ans/ANS-102.md https://github.com/ArweaveTeam/arweave-standards/blob/master/ans/ANS-104.md Their tags are Bundle-Format and Bundle-Version. I guess I'll just be more strict like that. [user@ log]$ git diff download.py diff --git a/download.py b/download.py index 448addb..23721a9 100644 --- a/download.py +++ b/download.py @@ -117,7 +117,7 @@ class Stream: tags = Transaction.frombytes(self.peer.unconfirmed_tx2(txid)).tags else: tags = self.peer.tx_tags(txid) - if any((ar.utils.b64dec_if_not_bytes(tag['name']).startswith(b'Bundle') for tag in tags)): + if any((ar.utils.b64dec_if_not_bytes(tag['name']) in (b'Bundle-Format', b'Bundle-Version') for tag in tags)): bundles.append(txid) return bundles def _cache_block(self, block): [user@ log]$ python3 download.py hellouniverse.json Caching 1056142: 100%|██████████████████████████████████████████| 5/5 [00:00<00:00, 10.92tx/s] Caching 1056143: 100%|████████████████████████████████████████| 40/40 [00:06<00:00, 6.22tx/s] Caching 1056144: 100%|████████████████████████████████████████| 24/24 [00:02<00:00, 8.81tx/s] Caching 1056145: 100%|██████████████████████████████████████████| 5/5 [00:00<00:00, 7.01tx/s] Caching 1056146: 100%|████████████████████████████████████████| 72/72 [00:08<00:00, 8.66tx/s] yielding capture @ 0 channel data: capture: 4 yielding capture @ 4 channel data: capture: 4 yielding capture @ 8 channel data: capture: 4 yielding capture @ 12 channel data: capture: 4 yielding capture @ 16 channel data: capture: 1 Hello, Universe! It works :D
flat_tree stands in a mud pit, covered in mud from a mud rain! rain down that mud on flat_tree! we observe the properties of mud splats, taking note on velocities of mud particles.
an interface for writing/reading needs a way to: - store data, and get a locator back for it - provide a locator, and retrieve data I started this: class IStore: def fetch(self, locator): raise NotImplementedError("fetch") def store(self, data): raise NotImplementedError("store")
Once upon a time, there was a budding young scientist, only about a few thousand years old, wandering the pretty fields. This budding young scientist was so excited about measuring things! "Oh, I want to measure the wildflowers!" exclaimed the young thousand-year-old worker! They took out a ruler they had bought at a drug store, and began to measure. Some flowers were very tiny! Only a few millimeters wide! A millimeter is like a small fraction of an inch. Other flowers were _really big_. Most of these flowers were growing near houses. Each flower got many numbers! There was the number for the widest distance from the tip of one petal to the tip of another, and the number for the distance from the start of a petal to its end: you had to average all these; and the number of how wide the round bit in the middle of the flower was! And don't forget the average width of the petals! Each flower had a height, as well, and the stems had a width, although the stem width wasn't as interesting. The bud scientist wouldn't store the numbers yet. They were just so excited about measuring at all!
Later, the measurer wanted to figure out about making a _database_ of all the numbers they encountered. They got out a book about SQL and NoSQL and spreadsheets and all sorts of different things. They planned and planned! Wouldn't it be fun to put all the numbers in a database, and calculate the averages over all of them! We could group them into different nearnesses, and try to guess about properties of larger populations, out on the landscape!
They already had a ruler, but went on a trip to the drug store again to get a notepad, so as to write down some of the numbers they found. There were numbers _everywhere!_ On the way to the drug store, they encountered wildflowers by the road, and took time to measure a lot of them.
---- I made a very basic interface for fetch/store . I made it pretty dissociated, so the name and interface choice is probably not quite what I've expected. Maybe I can think about its use a little, to work on that.
There are two major approaches going on in flat_tree: systems that can only append to the end, and systems that can write to the middle (or even extend backwards). I'm trying to implement the latter. With flat_tree, the purpose is to make it usable to help me stabilise it. Using something can help it stabilize, for me. I'm less likely to throw it out if it's something I'm using. So, it's a little public facing, and an easy-to-use and unlikely-to-change public interface is helpful. Adding this fetch/store thing may likely mean changing the log project to use it. - The current code is using json, and each json item has "fields" although I think it just stores the field data as a list of values.
participants (1)
-
Undescribed Horrific Abuse, One Victim & Survivor of Many