Re: how to best organise a large community "important" books library?

27 May 2017

      On Sat, May 27, 2017 at 01:23:41PM +1000, James A. Donald wrote:
...
On 2017-05-17 17:08, Zenaan Harkness wrote:
...
Has anyone done anything like this, and if so, how did you solve it?
(Medium term, the problem begs for a git-like addressed, git-like P2P
addressable/verifiable "content management" solution; e.g. if I have
a random collection of say 10K books, it would be nice to be able to
say something like:
  git library init
# add my books:
  git add .
  git commit
git library sort categories
  git library add index1 # create an "indexing" branch
  git commit
# add some upstream/p2p libraries for indexing/search/federation:
  git library index add upstream:uni.berkely
  git library p2p   add i2p.tunnel:myfriend.SHA
  git library full-index pull myfriend.SHA
  git library math-index pull myfriend.SHA
  git library book:SHA   pull myfriend.SHA
This is the problem of clustering in groups of enormously high dimension, which is a well studied problem in AI.
Take each substring of six words or less, that does not contain a full stop internally.  The substring may contain a
start of sentence marker at the beginning, and or an end of sentence marker at the end.
Each substring constitutes a vector in a space of enormously high dimension, the space of all possible strings of
moderate length.
Each such vector is randomly but deterministically mapped to a mere hundred or so dimensions, to a space of moderately
high dimension, by randomly giving each coordinate the value of plus one, minus one, or zero.
Two identical substrings in two separate documents will be mapped to the same vector in the space of moderate
dimension.
Each document is then given a position in the space of moderately high dimension by summing all the vectors, and then
normalizing the resulting vector to length one.
Thus each document is assigned a position on the hypersphere in the space of moderately high dimension.
If two different documents contain some identical substrings, will tend to be closer together.
We then assign a document to its closest group of documents, and the closest subgroup within that group, and the
closest sub sub group.
Interesting. Algorithmic auto-categorizing. Sounds like it has
potential to create at least one useful index.

The main bit I don't like about Library of Congress, Dewey and other
"physical library" categorization schemes is that they are evidently
"optimized" for physical storage, and so they arbitrarily group
categories which are not directly related.

But even related categories might as well be "different categories"
in the digital world - an extra folder/dir is relatively inexpensive
compared with "another physical [sub-]shelf and labelling which we
want to be relatively stable --in physical space-- over time, in
order to minimize shuffling/ reorganizing of the physical storage of
books (/maps /etc)".

Categories (and sub-, sub-sub- etc) are definitely useful to folks,
and ACM, some maths journals group etc, have typically created
4-levels of categories + sub categories --just for their field--.

Again, we have no physical limits in virtual categorization space,
and to top it off, with something git-like we can have as many
categorizations (directory hierarchies) as people find useful - so
some auto-algo thing as you suggest, LOC, Dewey, and "no useful
sub-category excluded" might all be wanted by different people - so
a content item exists, and is GUID addressed, and then one or more
indexes/ folder hierarchies overlay on top of this. And indeed one
content item such as a book may well be appropriately included in
more than one category/ s-category/ ss-category/ sss-category, --in
a single chosen hiearchy--, in any particular library.