
welcome to 824pm on sunday evening it's been hard to move near the code i started. tomorrow morning i have appointment again. maybe i can paste in a snippet and comment on a concern, this is part of where i left off. last time i elided the in-progress commenting, here it's included: idx = 0 while idx < len(offset_lengths): #for idx in pbar:#range(len(offset_lengths)): #offset, length = offset_lengths[idx] #tail = min(offset + length, len(self.mmap)) #aligned_offset = (offset // self.blksize) * self.blksize #aligned_tail = min(self.size(), (((tail - 1) // self.blksize) + 1) * self.blksize) aligned_offset = aligned_offsets[idx].item() if aligned_offset > min_hole: next_hole = self._next_sparse(aligned_offset, os.SEEK_HOLE) else: next_hole = min_hole missing_idcs = (next_hole < tails[idx:]).nonzero()[:,0] # here we are handling all undownloaded indices before the next cached ones # there could be many pages between them that don't need to be fetched num_missing_idcs = missing_idcs.shape[0] if num_missing_idcs > 0: i expect the logic in the above is not quite correct. i've been trying to re-understand `missing_idcs = (next_hole < tails[idx:]).nonzero()[:,0]`. i think this line was ported from the non-batched loop in a naive manner. roughly, i need to figure out how to correct process chunks of idcs that are cached and are not cached. when they are cached, then they are represented by a data region in the sparse mmap file. when they are not cached, then they are represented by a hole region. `self. _next_sparse(self, off, region)` is a wrapper around `os.lseek` -- given an offset, and a region type, it returns the offset of the next byte with the given region type, such that it returns the passed offset if it is of the given region type. this is presently the only interface into determining what is cached and what is not, and it works fine for procedural scans and should work fine here for now. (the caller of read_many needs to provide the offsets and lengths of the requested data already). [it's also reasonable to add an index to the cache, although it would add a concern of syncing sparsity to manage diskspace. a sparse file is a special kind of file where regions of the file do not occupy disk space and cannot contain nonzero data, these regions are called holes, and they can be written to which uses more disk space when done.] here's the current head of read_many: def read_many(self, offset_lengths, progress, validate_sorted=True): if validate_sorted: sorted_offset_lengths = list(offset_lengths) sorted_offset_lengths.sort() assert sorted_offset_lengths == offset_lengths OP_FETCH = 1 OP_PLACE = 2 OP_OUTPUT = 4 offset_length_tail_idx_ops = torch.zeros([offset_lengths.shape[0]*2, 5]) OFFSET, LENGTH, TAIL, IDX, OP = range(offset_length_tail_ops.shape[-1]) op_ct = 0 my idea to prototype things was to have the structure offset_length_tail_idx_ops contain all the data in one parallel block that is needed to operate. so first this structure is filled with data in batches using vector operations, and then as much of it as possible is operated on at once. the abstraction bounds can change. it's just what i have atm, and obviously everything's confusing and slow for me so i'm working off it part by part. the name `offset_length_tail_idx_ops` is named to directly represent the content of the rows in order. this is a convention i use to code using less of my working memory. i'm happy to add or remove data to this structure, and i do a find/replace to update the name when i do so. so the next step i was talking about, i was planning to implement involved preparing for filling this structure with `OP_OUTPUT` entries for each of the `offset_lengths` specifying what values to output back to the user. this is maybe not necessary, but it focuses me around the next logical concern -- which is ensuring uncached data is properly fetched, whereas cached data is not. so maybe not quite needed yet. been confused all day around ` missing_idcs = (next_hole < tails[idx:]).nonzero()[:,0]` despite writing this line earlier myself. [after success on finding region bounds]. given the loop is at `idx`, i gotta figure out which `idcs` to consider cached, and which to fetch, and how to organize that into a larger loop that repeats. i'm just getting really confused around that simple concern