
At 8:02 PM 4/19/96 -0500, Adam Shostack wrote:
Does anyone have some code that will search a dictionary, and tell me *quickly* if an arbitrary chunk of text is in the dictionary? Pre-indexing steps are fine, as is using big chunks of disk for hash tables. The point of course, is to check arbitrary possible plaintext that a test decryption produces.
This application sounds perfect for Bloom filters. The basic idea of a Bloom filter is to build a database by taking each word in the dictionary and hashing with N different hashes. The hashes do not need to by cryptographically secure, but they do need to be good hashes, XOR doesn't make it. You use those hashes as bit offsets in a giant bit map, which is the database. When building the database you turn on the bits at each of these N locations. When accessing the database, you hash the chunk of text with the same hashes, and then test the bits in the database at those offsets. If any of the bits are zero, then the chunk of text is not in the dictionary. The failure mode is to say something is in the dictionary when it isn't. If half the bits in the database are on, then the probability of failure is 2**(-N), so if N==10, then the failure rate is 1 in 1024. If empirically, you get a higher failure rate, check the quality of your hashes. For cryptanalysis, I might pick a higher N and eyeball check for failures. Say you want 1 in a million failure rate, and have an 80,000 word dictionary. You need a 20 * 80,000 * 2 bit database, which is 400,000 bytes. Regards - Bill ------------------------------------------------------------------------ Bill Frantz | The CDA means | Periwinkle -- Computer Consulting (408)356-8506 | lost jobs and | 16345 Englewood Ave. frantz@netcom.com | dead teenagers | Los Gatos, CA 95032, USA