On Wed, May 17, 2017 at 03:34:43AM -0500, Shawn K. Quinn wrote:
On 05/17/2017 03:14 AM, Ben Tasker wrote:
I guess your best bet would probably be to approach it like a physical library does.
Divide by broad category/genre (so separate fiction and non-fiction, subdivide that into health etc)
Then divide further by Author's name, perhaps dividing that further by the first two chars of the author's name
Dewey Decimal System, anyone? If it works for current real-world libraries, why wouldn't it work for something like this?
May be. A directory name based on the dewey number could be a candidate for "canonical content-item directory", although may or may not be ideal choice. Perhaps there's no reason to NOT have a human-readable "canonical location" - i.e.: category/sub-category/author/item/files... ? This is so that any part of the filesystem can be readily copied, and be useful and easily searchable in its own right without any meta-data aggregation/processing... I'm sure there are many useful types of indexing, e.g. ISBN, author, title, and any meta data that users could find useful. Although there might be a lot of "library-specific numbers" like dewey (is this universal like ISBN, or library specific?), which could also be useful in a global book indexing system, each is simply additional meta data. Meta data should be able to be added to any item at any time -> git backed item+meta data storage system. gitfs/git-fs fuse type system could also be more efficient - a --bare repository is just fine, if it can also be browsed/copied/etc like a normal filesystem, and with TiB of books, users don't want to duplicate everything just so they can access their (git) library via the filesystem. Besides fiction/non-fiction top level category, we could also have floss/public domain vs proprietary/still in copyright. This would provide for separation of content which may well need to be handled in different ways - freely distributable without obvious legal threat possibilities, vs. "non-libre, but stored as part of my doomsday prepping library on the principle that such usage is fair use" for example :) The only other "primary" category I can think of is language - i.e. which language is a book written in (and of course some dictionaries and "Learn Russian" type books are written in more than one language, althought they presumably always target a "primary" language). So this would give us a grand "canonical content indexing" hierarchy of: $LANG/ floss/ fiction/ $CATEGORY/ $SUB-CATEGORY/ $AUTHOR/ $BOOK/ ... non-fiction/ $CATEGORY/ $SUB-CATEGORY/ $AUTHOR/ $BOOK/ ... proprietary/ fiction/ $CATEGORY/ $SUB-CATEGORY/ $AUTHOR/ $BOOK/ ... non-fiction/ $CATEGORY/ $SUB-CATEGORY/ $AUTHOR/ $BOOK/ ... as top level directories. Discussed this just now with a friend and realised that humans genuinely will have interests such as "yeah, give me 500GiB of programming books, what the heck, but nah, not interested in all the natural sciences" vs "hey, math, physics, chem, yes please, my doomsday prepper library absolutely must have them all, along with engineering and survival books". So the most end-user friendly "canonical directory hierarchy" probably requires a lot of sub-categories for non-fiction at least. And doing so means something better than 20,000 (or more) directories (authors in the above scheme) in one category. Also with the aim of distributing the work load by making the job: - easy to comprehend - easy to do - easy to share we will create the ultimate distributed book/useful-content library index/meta data store, very naturally and easily, just like wikipedia started small and to quite some derision in certain quarters. With multi-TiB HDDs, we are close to the time when one person can indeed store the world's "relevant" (by his personal definition) knowledge. A little research on the wiki shows there are in fact impressively many fields of science and academia which can provide useful subdivisions for "all useful books", see e.g.: https://en.wikipedia.org/wiki/Outline_of_science https://en.wikipedia.org/wiki/Outline_of_academic_disciplines https://en.wikipedia.org/wiki/Branches_of_science https://en.wikipedia.org/wiki/Index_of_branches_of_science https://en.wikipedia.org/wiki/List_of_sciences https://en.wikipedia.org/wiki/Formal_sciences https://en.wikipedia.org/wiki/Outline_of_natural_science https://en.wikipedia.org/wiki/Social_science Finally, looking at: https://git.wiki.kernel.org/index.php/SubprojectSupport and thinking about things, it seems that git on a per-item basis, and possibly git-subproject for a category or sub-category basis, may work. Especially with --bare repos and some git fuse thing. But whether or not git can scale as needed, a library indexing system running over the top of this is certain to be useful, undoubtedly with useful guidance from the notmuch search libraries. Steps: 1. decide on canonical directory structure 2. decide on "important"/suggested meta data fields 3. decide if git per item dir is useful/ optimal 4. document the above 5. begin to store books according to this canonical format 6. start sharing + build international library coalition (Words like "coalition" make it sound bigger and better.) medium/longer term: 7. write indexing tools to aggregate information/ meta data files 8. support other meta data formats