Cypherpunks write code, so here's some (at the end, anyway). Someone asked awhile back (just before the deluge of postings on the Wiretap Chip swamped my announcement of the release of new versions of our Dolphin Encrypt encryption software) about (something like) how to tell whether a file consists of something like English text or just (apparent) garbage. Here's one way, a program to calculate the entropy (and the relative entropy) of the set of bytes in a file. First the documentation (extracted from Appendix III in the manual for the Dolphin Encryption Library): Information theorists have attempted to formalize and to quantify the notion of randomness, also called entropy. The usual definition of entropy in a string of letters from some alphabet is due to Claude Shannon (who formulated this concept in the 1950s). Let S be a string of letters from some alphabet A = { a(0), a(1), ..., a(k-1) } of k letters, and let p(i) be the probability (that is, the relative frequency) of occurrence of a(i) in the string S, then the entropy E of the string S may be defined as: k-1 E = - Sigma ( p(i) * ln ( p(i) ) ) i = 0 where ln is the natural logarithm. It can be shown that this value is maximized when all letters occur in S with equal frequency (in this case E = ln(k)), and is minimized when one letter occurs all the time (in this case E = 0). Since E ranges between 0 and ln(k), we may obtain a modified entropy value E', which we call relative entropy, which ranges between 0 and 1 by dividing E by ln(k) thus: E' = E / ln(k). The program ENTROPY1.EXE calculates the relative entropy of the bytes in a given file. For a DOS text file consisting of English text the relative entropy value is typically in the range 0.48 - .68. The relative entropy values for most non-random files, including .OBJ, .COM and .EXE files, usually fall in the range 0.50 through 0.95. Files consisting of bytes generated by pseudo-random-number generators typically have relative entropy values in the range 0.970 - 0.999. Thus a file with a relative entropy value of at least .98 looks (at least according to this test) very much like a file consisting of random bytes. ENTROPY.EXE can thus be used to test whether a file appears to consist of random bytes or something like natural language. The ENTROPY1.EXE program takes two parameters on the command line, a file specification (wildcard characters are not allowed in this version) and (optionally) a byte space size, e.g. ENTROPY1 FILE.TXT 150. The program produces results such as: File Size Entropy Rel. entropy Diff. bytes HAMLET.TXT 1459 3.037405 0.547756 42 PTRS.TXT 3683 3.415741 0.615984 108 CHAP04.TXT 51162 3.339292 0.602198 100 FILE1.RND 1762 5.473655 0.987102 255 FILE2.RND 3400 5.503647 0.992511 256 FILE3.RND 29225 5.541324 0.999305 256 HAMLET.ENC 1762 5.478605 0.987995 256 PTRS.ENC 3400 5.501231 0.992075 256 CHAP04.ENC 29225 5.540785 0.999208 256 NULLFILE 20000 0.000000 0.000000 1 The file called NULLFILE consists of 20,000 null (zero) bytes, and has a relative entropy value of zero (as do all files which contain only a single byte value). Note that the relative entropy values for the .ENC files (encrypted using Dolphin Encrypt) are about .99, as are those for the .RND files (created by using a pseudo-random-number generator similar to Microsoft's rand() function) of the same size. The last column gives the number of different bytes found in the file. This may be less than the size of the byte space for the file. If the size of the byte space is less than 256, as is the case with text files, then the space size parameter may be included in the command line, as in ENTROPY1 HAMLET.TXT 108. In this case the program produces results such as: File Size Entropy Rel. entropy Diff.bytes HAMLET.TXT 1459 3.037405 0.648723 42 PTRS.TXT 3683 3.415741 0.729527 108 CHAP04.TXT 51162 3.339292 0.713199 100 Thus decreasing the value for the byte space increases the entropy measure. Relative entropy tends to be larger for larger files. Now the C source code: /* ENTROPY1.C * Written by Peter Meyer, last revised 1993-04-27. * Calculates the relative entropy of the bytes in a file * defined as the negative of the sum for each byte of the product of * the relative probability of that byte times the natural log * of that byte, divided by the natural log of the number of * different bytes occurring in the file; values can range from 0 to 1. */ #include <STDIO.H> /* Microsoft header files */ #include <STDLIB.H> #include <MEMORY.H> #include <MATH.H> unsigned long n[256]; double p[256]; unsigned char *usage = "\nUse: ENTROPY1 filespec [space_size]" "\nspace_size = number of possible bytes (default = 256)\n"; void measure_entropy(unsigned char *filename, unsigned long *total, double *entropy, double *relative_entropy, unsigned int*num_diff_bytes, unsigned int *space_size, int *err_flag); /*-----------------------------*/ void main(int argc, char *argv[]) { int err_flag; unsigned int num_diff_bytes, space_size; unsigned long total; double entropy, relative_entropy; if ( argc == 1 ) { printf(usage); exit(0); } if ( argc == 2 ) space_size = 256; else { space_size = (unsigned int)atoi(argv[2]); if ( space_size == 0 || space_size > 256 ) { printf("\nInvalid space size.\n"); exit(1); } } measure_entropy(argv[1],&total,&entropy,&relative_entropy,&num_diff_bytes, &space_size,&err_flag); switch ( err_flag ) { case 0: /* no error */ printf("Space size = %u\n",space_size); printf("\n%15s%15s%15s%15s%15s", "File","Size","Entropy","Rel. entropy","Diff. bytes"); printf("\n%15s%15lu",argv[1],total); printf("%15.6f%15.6f%15d\n",entropy,relative_entropy,num_diff_bytes); exit(0); case -1: printf("\nCannot open file %s.\n",argv[1]); exit(2); case -2: printf("\n%15s is inconsistent with space size %d.\n", argv[1],space_size); exit(3); } } /*-----------------------------------------*/ void measure_entropy(unsigned char *filename, unsigned long *total, double *entropy, double *relative_entropy, unsigned int *num_diff_bytes, unsigned int *space_size, int *err_flag) { int j; FILE *file; *err_flag = 0; file = fopen(filename,"rb"); if ( file == NULL ) { *err_flag = -1; return; } /* zero the frequency array */ memset(n,0,256*sizeof(unsigned long)); /* count the byte values */ while ( !feof(file) ) n[fgetc(file)]++; /* get the number of bytes and the number of different byte values */ *num_diff_bytes = 0; *total = 0L; for ( j=0; j<256; j++ ) { *num_diff_bytes += ( n[j] != 0 ); *total += n[j]; } if ( *num_diff_bytes > *space_size ) { *err_flag = -2; fclose(file); return; } /* calculate the probabilities */ for ( j=0; j<256; j++ ) p[j] = ((double)n[j])/(*total); /* calculate the entropy */ *entropy = 0.0; for ( j=0; j<256; j++ ) { if ( p[j] ) *entropy += p[j]*log(p[j]); } *entropy = -1.0*(*entropy); /* calculate the relative entropy */ *relative_entropy = *entropy/log(*space_size); fclose(file); } If anyone wants the MS-DOS executable version of this program then send me (meyer@mcc.com) a snailmail address and I'll send it to you on the Dolphin Encrypt demonstration disk. -- Peter Meyer