how to best organise a large community "important" books library?

Wed May 17 04:46:01 PDT 2017

On Wed, May 17, 2017 at 03:34:43AM -0500, Shawn K. Quinn wrote:
> On 05/17/2017 03:14 AM, Ben Tasker wrote:
> > I guess your best bet would probably be to approach it like a physical
> > library does.
> > 
> > Divide by broad category/genre (so separate fiction and non-fiction,
> > subdivide that into health etc)
> > 
> > Then divide further by Author's name, perhaps dividing that further by
> > the first two chars of the author's name
> 
> Dewey Decimal System, anyone? If it works for current real-world
> libraries, why wouldn't it work for something like this?

May be. A directory name based on the dewey number could be a
candidate for "canonical content-item directory", although may or may
not be ideal choice.

Perhaps there's no reason to NOT have a human-readable "canonical
location" - i.e.:
   category/sub-category/author/item/files...
?

This is so that any part of the filesystem can be readily copied, and
be useful and easily searchable in its own right without any
meta-data aggregation/processing...

I'm sure there are many useful types of indexing, e.g. ISBN, author,
title, and any meta data that users could find useful.

Although there might be a lot of "library-specific numbers" like
dewey (is this universal like ISBN, or library specific?), which
could also be useful in a global book indexing system, each is simply
additional meta data.

Meta data should be able to be added to any item at any time -> git
backed item+meta data storage system.

gitfs/git-fs fuse type system could also be more efficient - a --bare
repository is just fine, if it can also be browsed/copied/etc like a
normal filesystem, and with TiB of books, users don't want to
duplicate everything just so they can access their (git) library via
the filesystem.

Besides fiction/non-fiction top level category, we could also have
floss/public domain vs proprietary/still in copyright.

This would provide for separation of content which may well need to
be handled in different ways - freely distributable without obvious
legal threat possibilities, vs. "non-libre, but stored as part of my
doomsday prepping library on the principle that such usage is fair
use" for example :)

The only other "primary" category I can think of is language - i.e.
which language is a book written in (and of course some dictionaries
and "Learn Russian" type books are written in more than one language,
althought they presumably always target a "primary" language).

So this would give us a grand "canonical content indexing" hierarchy
of:

$LANG/
   floss/
      fiction/
         $CATEGORY/
            $SUB-CATEGORY/
               $AUTHOR/
                  $BOOK/
         ...
      non-fiction/
         $CATEGORY/
            $SUB-CATEGORY/
               $AUTHOR/
                  $BOOK/
         ...

   proprietary/
      fiction/
         $CATEGORY/
            $SUB-CATEGORY/
               $AUTHOR/
                  $BOOK/
         ...
      non-fiction/
         $CATEGORY/
            $SUB-CATEGORY/
               $AUTHOR/
                  $BOOK/
         ...

as top level directories.

Discussed this just now with a friend and realised that humans
genuinely will have interests such as "yeah, give me 500GiB of
programming books, what the heck, but nah, not interested in all the
natural sciences" vs "hey, math, physics, chem, yes please, my
doomsday prepper library absolutely must have them all, along with
engineering and survival books".

So the most end-user friendly "canonical directory hierarchy"
probably requires a lot of sub-categories for non-fiction at least.

And doing so means something better than 20,000 (or more) directories
(authors in the above scheme) in one category.

Also with the aim of distributing the work load by making the job:
   -  easy to comprehend
   -  easy to do
   -  easy to share

we will create the ultimate distributed book/useful-content library
index/meta data store, very naturally and easily, just like wikipedia
started small and to quite some derision in certain quarters.

With multi-TiB HDDs, we are close to the time when one person can
indeed store the world's "relevant" (by his personal definition)
knowledge.

A little research on the wiki shows there are in fact impressively
many fields of science and academia which can provide useful
subdivisions for "all useful books", see e.g.:

   https://en.wikipedia.org/wiki/Outline_of_science
   https://en.wikipedia.org/wiki/Outline_of_academic_disciplines

   https://en.wikipedia.org/wiki/Branches_of_science
   https://en.wikipedia.org/wiki/Index_of_branches_of_science
   https://en.wikipedia.org/wiki/List_of_sciences

   https://en.wikipedia.org/wiki/Formal_sciences
   https://en.wikipedia.org/wiki/Outline_of_natural_science
   https://en.wikipedia.org/wiki/Social_science

Finally, looking at:
   https://git.wiki.kernel.org/index.php/SubprojectSupport

and thinking about things, it seems that git on a per-item basis, and
possibly git-subproject for a category or sub-category basis, may
work. Especially with --bare repos and some git fuse thing.

But whether or not git can scale as needed, a library indexing system
running over the top of this is certain to be useful, undoubtedly
with useful guidance from the notmuch search libraries.

Steps:
   1. decide on canonical directory structure
   2. decide on "important"/suggested meta data fields
   3. decide if git per item dir is useful/ optimal
   4. document the above
   5. begin to store books according to this canonical format
   6. start sharing + build international library coalition

(Words like "coalition" make it sound bigger and better.)

medium/longer term:
   7. write indexing tools to aggregate information/ meta data files
   8. support other meta data formats