how to best organise a large community "important" books library?

newer
Mystical Things May Be Inside Your...

Zenaan Harkness

17 May 2017 17 May '17

7:08 a.m.

So you're a doomsday prepper and you intend to throw a few dollars at a multi TiB "restart civilization" digital book library with a few million books, a few laptops and some solar cells to power them. How best to hierarchically organise the group/ community doomsday digital book library? Say 5 million books, each in its own directory (for any cover art, PDF and or text versions of the book, perhaps audio, code, etc). Debian has ~40K packages, and divides these into {a..z}* and lib{a..z}*, for roughly 60 "top level" directories. Each package has its subdirectory in one of these. The largest directories are at about 2K to 3K entries (subdirectories). However, when prepping for a few million books, we have around 1000 times as many "packages"/books to store, which could mean e.g. 50,000 or more in the popular s directory for example - probably a little more than is sane. "Formal" library software (e.g. Evergreen) may be the answer for repo wide searchability of meta data, but when others share interest, they may only be able to store, and or may be only particularly interested in a sub-category or two such as health and agriculture/gardening, so it might be best anyway to use a handful of top level "general categories" to reduce our maximums down from 50K books per dir, to 10 fold fewer at least. Has anyone done anything like this, and if so, how did you solve it? (Medium term, the problem begs for a git-like addressed, git-like P2P addressable/verifiable "content management" solution; e.g. if I have a random collection of say 10K books, it would be nice to be able to say something like: git library init # add my books: git add . git commit git library sort categories git library add index1 # create an "indexing" branch git commit # add some upstream/p2p libraries for indexing/search/federation: git library index add upstream:uni.berkely git library p2p add i2p.tunnel:myfriend.SHA git library full-index pull myfriend.SHA git library math-index pull myfriend.SHA git library book:SHA pull myfriend.SHA ) Perhaps this has not been really done before?

Show replies by date

Ben Tasker

17 May 17 May

8:14 a.m.

On Wed, May 17, 2017 at 8:08 AM, Zenaan Harkness <zen@freedbms.net> wrote:

...

"Formal" library software (e.g. Evergreen) may be the answer for repo wide searchability of meta data, but when others share interest, they may only be able to store, and or may be only particularly interested in a sub-category or two such as health and agriculture/gardening, so it might be best anyway to use a handful of top level "general categories" to reduce our maximums down from 50K books per dir, to 10 fold fewer at least.

I guess your best bet would probably be to approach it like a physical library does. Divide by broad category/genre (so separate fiction and non-fiction, subdivide that into health etc) Then divide further by Author's name, perhaps dividing that further by the first two chars of the author's name But it'd get complicated quite quickly (particularly if you don't know the author's name) so you'd want some sort of index available to do metadata based searching too. I think that's probably going to be hard to avoid with a substantial number of books though. -- Ben Tasker https://www.bentasker.co.uk

Shawn K. Quinn

8:34 a.m.

On 05/17/2017 03:14 AM, Ben Tasker wrote:

...

I guess your best bet would probably be to approach it like a physical library does.

Divide by broad category/genre (so separate fiction and non-fiction, subdivide that into health etc)

Then divide further by Author's name, perhaps dividing that further by the first two chars of the author's name

Dewey Decimal System, anyone? If it works for current real-world libraries, why wouldn't it work for something like this? -- Shawn K. Quinn <skquinn@rushpost.com> http://www.rantroulette.com http://www.skqrecordquest.com

Zenaan Harkness

11:46 a.m.

On Wed, May 17, 2017 at 03:34:43AM -0500, Shawn K. Quinn wrote:

...

On 05/17/2017 03:14 AM, Ben Tasker wrote:

...
I guess your best bet would probably be to approach it like a physical library does.

Divide by broad category/genre (so separate fiction and non-fiction, subdivide that into health etc)

Then divide further by Author's name, perhaps dividing that further by the first two chars of the author's name

Dewey Decimal System, anyone? If it works for current real-world libraries, why wouldn't it work for something like this?

May be. A directory name based on the dewey number could be a candidate for "canonical content-item directory", although may or may not be ideal choice. Perhaps there's no reason to NOT have a human-readable "canonical location" - i.e.: category/sub-category/author/item/files... ? This is so that any part of the filesystem can be readily copied, and be useful and easily searchable in its own right without any meta-data aggregation/processing... I'm sure there are many useful types of indexing, e.g. ISBN, author, title, and any meta data that users could find useful. Although there might be a lot of "library-specific numbers" like dewey (is this universal like ISBN, or library specific?), which could also be useful in a global book indexing system, each is simply additional meta data. Meta data should be able to be added to any item at any time -> git backed item+meta data storage system. gitfs/git-fs fuse type system could also be more efficient - a --bare repository is just fine, if it can also be browsed/copied/etc like a normal filesystem, and with TiB of books, users don't want to duplicate everything just so they can access their (git) library via the filesystem. Besides fiction/non-fiction top level category, we could also have floss/public domain vs proprietary/still in copyright. This would provide for separation of content which may well need to be handled in different ways - freely distributable without obvious legal threat possibilities, vs. "non-libre, but stored as part of my doomsday prepping library on the principle that such usage is fair use" for example :) The only other "primary" category I can think of is language - i.e. which language is a book written in (and of course some dictionaries and "Learn Russian" type books are written in more than one language, althought they presumably always target a "primary" language). So this would give us a grand "canonical content indexing" hierarchy of: $LANG/ floss/ fiction/ $CATEGORY/ $SUB-CATEGORY/ $AUTHOR/ $BOOK/ ... non-fiction/ $CATEGORY/ $SUB-CATEGORY/ $AUTHOR/ $BOOK/ ... proprietary/ fiction/ $CATEGORY/ $SUB-CATEGORY/ $AUTHOR/ $BOOK/ ... non-fiction/ $CATEGORY/ $SUB-CATEGORY/ $AUTHOR/ $BOOK/ ... as top level directories. Discussed this just now with a friend and realised that humans genuinely will have interests such as "yeah, give me 500GiB of programming books, what the heck, but nah, not interested in all the natural sciences" vs "hey, math, physics, chem, yes please, my doomsday prepper library absolutely must have them all, along with engineering and survival books". So the most end-user friendly "canonical directory hierarchy" probably requires a lot of sub-categories for non-fiction at least. And doing so means something better than 20,000 (or more) directories (authors in the above scheme) in one category. Also with the aim of distributing the work load by making the job: - easy to comprehend - easy to do - easy to share we will create the ultimate distributed book/useful-content library index/meta data store, very naturally and easily, just like wikipedia started small and to quite some derision in certain quarters. With multi-TiB HDDs, we are close to the time when one person can indeed store the world's "relevant" (by his personal definition) knowledge. A little research on the wiki shows there are in fact impressively many fields of science and academia which can provide useful subdivisions for "all useful books", see e.g.: https://en.wikipedia.org/wiki/Outline_of_science https://en.wikipedia.org/wiki/Outline_of_academic_disciplines https://en.wikipedia.org/wiki/Branches_of_science https://en.wikipedia.org/wiki/Index_of_branches_of_science https://en.wikipedia.org/wiki/List_of_sciences https://en.wikipedia.org/wiki/Formal_sciences https://en.wikipedia.org/wiki/Outline_of_natural_science https://en.wikipedia.org/wiki/Social_science Finally, looking at: https://git.wiki.kernel.org/index.php/SubprojectSupport and thinking about things, it seems that git on a per-item basis, and possibly git-subproject for a category or sub-category basis, may work. Especially with --bare repos and some git fuse thing. But whether or not git can scale as needed, a library indexing system running over the top of this is certain to be useful, undoubtedly with useful guidance from the notmuch search libraries. Steps: 1. decide on canonical directory structure 2. decide on "important"/suggested meta data fields 3. decide if git per item dir is useful/ optimal 4. document the above 5. begin to store books according to this canonical format 6. start sharing + build international library coalition (Words like "coalition" make it sound bigger and better.) medium/longer term: 7. write indexing tools to aggregate information/ meta data files 8. support other meta data formats

Razer

2:05 p.m.

On 05/17/2017 01:34 AM, Shawn K. Quinn wrote:

...

...
I guess your best bet would probably be to approach it like a physical library does.

Divide by broad category/genre (so separate fiction and non-fiction, subdivide that into health etc)

Then divide further by Author's name, perhaps dividing that further by the first two chars of the author's name Dewey Decimal System, anyone? If it works for current real-world

On 05/17/2017 03:14 AM, Ben Tasker wrote: libraries, why wouldn't it work for something like this?

My mutha wuz a Noow yorc publik skool liberian. Do-e Desmal wurks, then therz https://en.wikipedia.org/wiki/Library_of_Congress_Classification Either sistem is hella ez 2 digitiz. Rr

Zenaan Harkness

2:29 p.m.

On Wed, May 17, 2017 at 07:05:53AM -0700, Razer wrote:

...

On 05/17/2017 01:34 AM, Shawn K. Quinn wrote:

...
...
I guess your best bet would probably be to approach it like a physical library does.

Divide by broad category/genre (so separate fiction and non-fiction, subdivide that into health etc)

Then divide further by Author's name, perhaps dividing that further by the first two chars of the author's name Dewey Decimal System, anyone? If it works for current real-world

On 05/17/2017 03:14 AM, Ben Tasker wrote: libraries, why wouldn't it work for something like this?

My mutha wuz a Noow yorc publik skool liberian. Do-e Desmal wurks, then therz

https://en.wikipedia.org/wiki/Library_of_Congress_Classification

Thanks - I'll check it out. It starts: https://github.com/zenaan/large-digital-library git://github.com/zenaan/large-digital-library.git

No

4:39 p.m.

You only need one book! https://www.goodreads.com/book/show/18114087-the-knowledge Not true, but still a good read. On 05/17/2017 04:29 PM, Zenaan Harkness wrote:

...

On Wed, May 17, 2017 at 07:05:53AM -0700, Razer wrote:

...
On 05/17/2017 01:34 AM, Shawn K. Quinn wrote:

...
...
I guess your best bet would probably be to approach it like a physical library does.

Divide by broad category/genre (so separate fiction and non-fiction, subdivide that into health etc)

Then divide further by Author's name, perhaps dividing that further by the first two chars of the author's name Dewey Decimal System, anyone? If it works for current real-world

On 05/17/2017 03:14 AM, Ben Tasker wrote: libraries, why wouldn't it work for something like this?

My mutha wuz a Noow yorc publik skool liberian. Do-e Desmal wurks, then therz

https://en.wikipedia.org/wiki/Library_of_Congress_Classification

Thanks - I'll check it out.

It starts:

https://github.com/zenaan/large-digital-library

git://github.com/zenaan/large-digital-library.git

Zenaan Harkness

10:50 a.m.

On Wed, May 17, 2017 at 09:14:41AM +0100, Ben Tasker wrote:

...

On Wed, May 17, 2017 at 8:08 AM, Zenaan Harkness <zen@freedbms.net> wrote:

...
"Formal" library software (e.g. Evergreen) may be the answer for repo wide searchability of meta data, but when others share interest, they may only be able to store, and or may be only particularly interested in a sub-category or two such as health and agriculture/gardening, so it might be best anyway to use a handful of top level "general categories" to reduce our maximums down from 50K books per dir, to 10 fold fewer at least.

I guess your best bet would probably be to approach it like a physical library does.

Ack.

...

Divide by broad category/genre (so separate fiction and non-fiction, subdivide that into health etc)

Personally not interested in fiction, but of course some will be, and "the library" should work for everyone, including those who want to store all books in all languages :)

...

Then divide further by Author's name, perhaps dividing that further by the first two chars of the author's name

Sounds good - many authors write more than one book, and its good to group by author, even though that will result in some authors with more than one directory (those who write books existing in more than 1 category).

...

But it'd get complicated quite quickly (particularly if you don't know the author's name)

And some books don't have an author - especially some of the older historical books, but I guess they could be grouped under "author: anonymous".

...

so you'd want some sort of index available

Definitely.

...

to do metadata based searching too. I think that's probably going to be hard to avoid with a substantial number of books though.

A principle is to have each item be self contained, in its own directory, so different editions/versions and associated assets (covers, imagery, audio) can all be sanely grouped for that book - different editions should probably each have their own directory. Rather than avoid an index, and based on the principle above, each "content item" will eventually have its own meta data file, and this, like every other file associated with a "content item" should be physically associated with the item - i.e., in the item's canonical directory, and then: automate the index creation by processing the meta data files, and supplementing this with information gathered directly from the filesystem (file sizes, PDF "page count"s, file types (image, text, PDF etc) and everything else that can be auto gathered) - DRY, don't repeat yourself, so don't "write" meta data that is in the filesystem, at the least filename and file size. YAML or YAML like for meta data files, seems to be the nicest for humans to read+write, and still very good for automated processing. I began this for my software repo years ago, but need to rewrite it. Thanks,

James A. Donald

27 May 27 May

3:23 a.m.

On 2017-05-17 17:08, Zenaan Harkness wrote:

...

Has anyone done anything like this, and if so, how did you solve it?

(Medium term, the problem begs for a git-like addressed, git-like P2P addressable/verifiable "content management" solution; e.g. if I have a random collection of say 10K books, it would be nice to be able to say something like: git library init

# add my books: git add . git commit

git library sort categories git library add index1 # create an "indexing" branch git commit

# add some upstream/p2p libraries for indexing/search/federation: git library index add upstream:uni.berkely git library p2p add i2p.tunnel:myfriend.SHA git library full-index pull myfriend.SHA git library math-index pull myfriend.SHA git library book:SHA pull myfriend.SHA

This is the problem of clustering in groups of enormously high dimension, which is a well studied problem in AI. Take each substring of six words or less, that does not contain a full stop internally. The substring may contain a start of sentence marker at the beginning, and or an end of sentence marker at the end. Each substring constitutes a vector in a space of enormously high dimension, the space of all possible strings of moderate length. Each such vector is randomly but deterministically mapped to a mere hundred or so dimensions, to a space of moderately high dimension, by randomly giving each coordinate the value of plus one, minus one, or zero. Two identical substrings in two separate documents will be mapped to the same vector in the space of moderate dimension. Each document is then given a position in the space of moderately high dimension by summing all the vectors, and then normalizing the resulting vector to length one. Thus each document is assigned a position on the hypersphere in the space of moderately high dimension. If two different documents contain some identical substrings, will tend to be closer together. We then assign a document to its closest group of documents, and the closest subgroup within that group, and the closest sub sub group.

Zenaan Harkness

6:11 a.m.

On Sat, May 27, 2017 at 01:23:41PM +1000, James A. Donald wrote:

...

On 2017-05-17 17:08, Zenaan Harkness wrote:

...
Has anyone done anything like this, and if so, how did you solve it?

(Medium term, the problem begs for a git-like addressed, git-like P2P addressable/verifiable "content management" solution; e.g. if I have a random collection of say 10K books, it would be nice to be able to say something like: git library init

# add my books: git add . git commit

git library sort categories git library add index1 # create an "indexing" branch git commit

# add some upstream/p2p libraries for indexing/search/federation: git library index add upstream:uni.berkely git library p2p add i2p.tunnel:myfriend.SHA git library full-index pull myfriend.SHA git library math-index pull myfriend.SHA git library book:SHA pull myfriend.SHA

This is the problem of clustering in groups of enormously high dimension, which is a well studied problem in AI.

Take each substring of six words or less, that does not contain a full stop internally. The substring may contain a start of sentence marker at the beginning, and or an end of sentence marker at the end.

Each substring constitutes a vector in a space of enormously high dimension, the space of all possible strings of moderate length.

Each such vector is randomly but deterministically mapped to a mere hundred or so dimensions, to a space of moderately high dimension, by randomly giving each coordinate the value of plus one, minus one, or zero.

Two identical substrings in two separate documents will be mapped to the same vector in the space of moderate dimension.

Each document is then given a position in the space of moderately high dimension by summing all the vectors, and then normalizing the resulting vector to length one.

Thus each document is assigned a position on the hypersphere in the space of moderately high dimension.

If two different documents contain some identical substrings, will tend to be closer together.

We then assign a document to its closest group of documents, and the closest subgroup within that group, and the closest sub sub group.

Interesting. Algorithmic auto-categorizing. Sounds like it has potential to create at least one useful index. The main bit I don't like about Library of Congress, Dewey and other "physical library" categorization schemes is that they are evidently "optimized" for physical storage, and so they arbitrarily group categories which are not directly related. But even related categories might as well be "different categories" in the digital world - an extra folder/dir is relatively inexpensive compared with "another physical [sub-]shelf and labelling which we want to be relatively stable --in physical space-- over time, in order to minimize shuffling/ reorganizing of the physical storage of books (/maps /etc)". Categories (and sub-, sub-sub- etc) are definitely useful to folks, and ACM, some maths journals group etc, have typically created 4-levels of categories + sub categories --just for their field--. Again, we have no physical limits in virtual categorization space, and to top it off, with something git-like we can have as many categorizations (directory hierarchies) as people find useful - so some auto-algo thing as you suggest, LOC, Dewey, and "no useful sub-category excluded" might all be wanted by different people - so a content item exists, and is GUID addressed, and then one or more indexes/ folder hierarchies overlay on top of this. And indeed one content item such as a book may well be appropriately included in more than one category/ s-category/ ss-category/ sss-category, --in a single chosen hiearchy--, in any particular library.

Razer

1:33 p.m.

On 05/26/2017 11:11 PM, Zenaan Harkness wrote:

...

The main bit I don't like about Library of Congress, Dewey and other "physical library" categorization schemes is that they are evidently "optimized" for physical storage, and so they arbitrarily group categories which are not directly related.

No they don't.

Zenaan Harkness

8:08 p.m.

On Sat, May 27, 2017 at 06:33:42AM -0700, Razer wrote:

...

On 05/26/2017 11:11 PM, Zenaan Harkness wrote:

...
The main bit I don't like about Library of Congress, Dewey and other "physical library" categorization schemes is that they are evidently "optimized" for physical storage, and so they arbitrarily group categories which are not directly related.

No they don't.

Sure, you can say "G", the "Geography, Anthropology, and Recreation" is a category of related topics, but some people want these categories in separate categories, not in the same category. And then you can say "oh, but inside G you've got sub-categories" - well, that's the point - we don't have to limit our top level categories to 26 letters of the alphabet, in the world of file systems and folders. But we can do that as well. We can have multiple indices of different types, it's just digital after all. (May be there really is no physical correlation/ optimisation going on with LOC classification, if that's what you're trying to say - either way it's a wood for the trees on an irrelevant point.)

Razer

28 May 28 May

1:07 a.m.

On 05/27/2017 01:08 PM, Zenaan Harkness wrote:

...

On Sat, May 27, 2017 at 06:33:42AM -0700, Razer wrote:

...
On 05/26/2017 11:11 PM, Zenaan Harkness wrote:

...
The main bit I don't like about Library of Congress, Dewey and other "physical library" categorization schemes is that they are evidently "optimized" for physical storage, and so they arbitrarily group categories which are not directly related. No they don't.

Sure, you can say "G", the "Geography, Anthropology, and Recreation" is a category of related topics, but some people want these categories in separate categories, not in the same category.

And then you can say "oh, but inside G you've got sub-categories" - well, that's the point - we don't have to limit our top level categories to 26 letters of the alphabet, in the world of file systems and folders.

But we can do that as well.

We can have multiple indices of different types, it's just digital after all.

(May be there really is no physical correlation/ optimisation going on with LOC classification, if that's what you're trying to say - either way it's a wood for the trees on an irrelevant point.)

You're discussing books. It doesn't matter if they're paper or PDF or DocX. The assumption is you want to categorize them by their CONTENT which is what the LoC and Dewey Decimal system do. Rr

Shawn K. Quinn

2:42 a.m.

On 05/27/2017 08:07 PM, Razer wrote:

...

You're discussing books. It doesn't matter if they're paper or PDF or DocX. The assumption is you want to categorize them by their CONTENT which is what the LoC and Dewey Decimal system do.

And now that I look again, LoC is probably a better system for this purpose, if for the only reason the Dewey system is supposed to be licensed by libraries using it. Yes, it's weird to group geography, anthropology, and recreation under one main category, but I'm not sure there's a better place for it under the other broad categories. It's much the same problem as the shortcomings of the original Encyclopedia of Chess Openings scheme of classifying chess openings has started to show its limitations, with a lot of codes being devoted to openings no longer played and a lot of lines of common openings being stuck under the exact same code. Thus the reason for the unofficial "Scid extensions" etc. If neither LoC nor (a bootleg use of) Dewey fit your needs, there's always the option of rolling your own. -- Shawn K. Quinn <skquinn@rushpost.com> http://www.rantroulette.com http://www.skqrecordquest.com

Zenaan Harkness

6:35 a.m.

On Sat, May 27, 2017 at 09:42:18PM -0500, Shawn K. Quinn wrote:

...

On 05/27/2017 08:07 PM, Razer wrote:

...
You're discussing books. It doesn't matter if they're paper or PDF or DocX. The assumption is you want to categorize them by their CONTENT which is what the LoC and Dewey Decimal system do.

And now that I look again, LoC is probably a better system for this purpose, if for the only reason the Dewey system is supposed to be licensed by libraries using it.

Yes, it's weird to group geography, anthropology, and recreation under one main category, but I'm not sure there's a better place for it under the other broad categories. It's much the same problem as the shortcomings of the original Encyclopedia of Chess Openings scheme of classifying chess openings has started to show its limitations, with a lot of codes being devoted to openings no longer played and a lot of lines of common openings being stuck under the exact same code. Thus the reason for the unofficial "Scid extensions" etc.

If neither LoC nor (a bootleg use of) Dewey fit your needs, there's always the option of rolling your own.

That's the point. Digitally, we have just one set of "GUID"-addressed content, and as many hierarchies/indicies on top, which points to that content. Just like git. I still hopeful this seems really obvious by now ..

3011

Age (days ago)

3022

Last active (days ago)

List overview

Download

14 comments

6 participants

participants (6)

Ben Tasker
James A. Donald
No
Razer
Shawn K. Quinn
Zenaan Harkness