how to best organise a large community "important" books library?

James A. Donald jamesd at echeque.com
Fri May 26 20:23:41 PDT 2017


On 2017-05-17 17:08, Zenaan Harkness wrote:
> Has anyone done anything like this, and if so, how did you solve it?
>
> (Medium term, the problem begs for a git-like addressed, git-like P2P
> addressable/verifiable "content management" solution; e.g. if I have
> a random collection of say 10K books, it would be nice to be able to
> say something like:
>    git library init
>
>    # add my books:
>    git add .
>    git commit
>
>    git library sort categories
>    git library add index1 # create an "indexing" branch
>    git commit
>
>    # add some upstream/p2p libraries for indexing/search/federation:
>    git library index add upstream:uni.berkely
>    git library p2p   add i2p.tunnel:myfriend.SHA
>    git library full-index pull myfriend.SHA
>    git library math-index pull myfriend.SHA
>    git library book:SHA   pull myfriend.SHA

This is the problem of clustering in groups of enormously high 
dimension, which is a well studied problem in AI.

Take each substring of six words or less, that does not contain a full 
stop internally.  The substring may contain a start of sentence marker 
at the beginning, and or an end of sentence marker at the end.

Each substring constitutes a vector in a space of enormously high 
dimension, the space of all possible strings of moderate length.

Each such vector is randomly but deterministically mapped to a mere 
hundred or so dimensions, to a space of moderately high dimension, by 
randomly giving each coordinate the value of plus one, minus one, or zero.

Two identical substrings in two separate documents will be mapped to the 
same vector in the space of moderate dimension.

Each document is then given a position in the space of moderately high 
dimension by summing all the vectors, and then normalizing the resulting 
vector to length one.

Thus each document is assigned a position on the hypersphere in the 
space of moderately high dimension.

If two different documents contain some identical substrings, will tend 
to be closer together.

We then assign a document to its closest group of documents, and the 
closest subgroup within that group, and the closest sub sub group.




More information about the cypherpunks mailing list