I'm enjoying the discussion of encrypting file systems, but have a perhaps-naive question: can the methods recently proposed here work for fast "random" access of bytes from the middle of a possibly-large file? Specifically, over the years I have written some free-text information-retrieval programs which build complete inverted indices to every word in a chosen text file (which may be many megabytes long, limited by disk space, not by RAM) --- and in order to fetch and display text quickly from an arbitrary point in the file, my programs do a lot of fseek() operations. If a file is encrypted under various schemes, I wonder how long it would take to fetch byte 100,000,000? Could it cause me some performance problems? :-) Just thought I'd raise the issue.... BTW, if anybody wants to work with large text files, the stuff I've done is all free under GNU GPL; for nicest user interface, see Mac version which hides behind HyperCard (in INFO-MAC archive at sumex-aim.stanford.edu, under directory info-mac/card with a name beginning "freetext", I think). Generic command-line C code to build indices is "qndxr.c" in various archives, and the generic command-line browser is "brwsr.c". See description in THE DIGITAL WORD, eds. Landow & Delany, MIT Press, 1993, pps. 53-68, for more details. Briefly, the programs let you scroll around in alphabetized word lists, generate key-word-in-context displays and do simple proximity filtering, and retrieve chunks of text on demand, very fast. Index-building is 15-20 MB/hour on an older Mac II-class machine, 60-80 MB/hour on a Sparcstation, etc. Best, ^z (no relation!)
Mark: There are two possible ways to encrypt the file as they go to the disk, and I think which you choose would determine what problems an fseek would encounter... As I understand the conversation so far, the talk is to make the encrypted disk software a device driver. Imagine what a typical device driver must do. The operating system wishes to see all files as long strings (streams) of bytes and the disk drive wishes to see all "files" as a collection of sectors (of a fixed size). The device driver converts a request for file position 'x' to a request to locate sector 'y'. Even if you only want one byte from the file, a whole sector gets read in. So heres where you have to decide which way you are going to encrypt the file. If you are going to encrypt each sector in isolation, then when the operating system requests a certain file location, that maps to a certain sector which is read in, decrypted (in isolation from the rest of the sectors) and then a particular byte is readily available. If however you use a different scheme whereby the whole file is encrypted as a single entity (which would be difficult to impossible to do using the device driver metaphor, but there are other ways to wedge encryption into the system) then presumably you need to decrypt from the very beginning any time you need to seek into the file, which is what I think you were worried about. If you are familiar with data compression products, then here is a comparison of the two techniques: If you use something like stacker, which is implemented as a device driver, then you have access to any byte of any file at any time. If you use something like pklite (the .exe compressor) then you can never seek into the file (to load an overlay for example). You have to read the whole file and decompress it in order access the bytes individually as uncompressed data. (That is not a perfect metaphor, as pklite files are self decompressing when they execute, not when the operating system accesses them, but it does serve to show the limitations imposed my higher level file management.) --- Nick MacDonald | NMD on IRC i6t4@jupiter.sun.csd.unb.ca | PGP 2.1 Public key available via finger On Sun, 6 Jun 1993, Mark Edward Zimmerman wrote:
I'm enjoying the discussion of encrypting file systems, but have a perhaps-naive question: can the methods recently proposed here work for fast "random" access of bytes from the middle of a possibly-large file?
Specifically, over the years I have written some free-text information-retrieval programs which build complete inverted indices to every word in a chosen text file (which may be many megabytes long, limited by disk space, not by RAM) --- and in order to fetch and display text quickly from an arbitrary point in the file, my programs do a lot of fseek() operations. If a file is encrypted under various schemes, I wonder how long it would take to fetch byte 100,000,000? Could it cause me some performance problems? :-)
can the methods recently proposed here work for fast "random" access of bytes from the middle of a possibly-large file?
The model that has been most discussed recently has been that of encrypting sectors on the hard disk. In order to have random access to files, you have to have random access to sectors. Therefore, the encryption mechanism chosen must support random sector access. This is not difficult, but many of the techniques used for telecommunication encryption do not work. In particular, encryption modes that depend upon some previous state of the encryption machine do not work well. Cipher block chaining is a mode of operation for block ciphers that where the plaintext is xor'd with the previous block of ciphertext before encryption. The first block of plaintext, where there is no previous block, is xor'd with an initial vector, which may be considered part of the keying material. Now consider what would happen if you encrypted your whole disk in CBC mode. You'd have to start at the beginning of the disk and decrypt up to the point that you want to read. For a bit stream, this is fine, since one is decrypting the whole thing. CBC, however, is useful for doing sector encryption. A DES block is 8 bytes, a sector is typically 512. I assume here that one has to read the whole sector out of memory, although with some very clever and not obviously worthwhile optimizations one could decrypt on demand. Now CBC is a reasonable choice ifor in-sector encryption, because you have to read the whole thing anyway. Yet CBC requiress an initial vector. This is where counter mode come in. A good block cipher has what is called the avalanche property, which says that altering any bit of the input alters on average half of the bits of the output. (Note: if it altered more than half, the 1's-complement would change by less than half.) Thus the initial vectors do not need to change particularly much from one initial vector to the next. Hence an integer-valued counter works fine. For hard disks the sector number, already present, makes just such a unique initial vector. Summary: CBC within sectors, initial vectors provided by the sector number. This characterization of keying material works for block ciphers generally and yields a clean abstraction for the rest of the system. algorithm identifier (index or function pointer or link spec) plaintext/ciphertext block length key length The rest of the encryption code need only know these values. Here are some examples. Lengths are byte lengths. single DES, 8, 8 (64 bits, of which only 56 are used) double DES, 8, 16 triple DES, 8, 24 IDEA, 8, 16 Nice and clean. Eric
participants (3)
-
Eric Hughes
-
Nickey MacDonald
-
zimm@alumni.cco.caltech.edu