entropy, with code
The entropy-calculating code is at the end of this message. I took the opportunity to calculate some sample entropies: entropy.c 5.283772 the source code entropy.asc 6.052222 entropy.c, encrypted and armored entropy.as2 6.012493 entropy.asc, with the wrappers removed entropy.pgp 7.830532 entropy.c, encrypted alone entropy.obj 6.112890 entropy.exe 6.947111 randseed.bin 4.584963 from pgp, 24 bytes long pubring.pgp 7.754017 my public key ring The entropy of the source code is in the high end of the range for English, which is not surprising given the amount of symbols in ordinary C code. The entropy increases with the object and the executable, each of which has less overt structure to it. The entropy of the encrypted and ascii armored source code is within 1% of 6 bits, just as predicted. And with the wrappers removed, it's even closer! The binary version of the encrypted message has the highest entropy of all these tests. In randseed.bin, the entropy is much less than 8. But the length of the file is 24 bytes and log_2 24 = 4.584963, indicating that there are no duplicate bytes in the file. Hence the warning: entropy calculations with small samples _will_ be misleading. Note that the entropy subroutine can be used to calculate the frequencies of any distribution. This will allow all you code-writing cypherpunks to measure bit entropies, digram entropies, etc. Eric ---------------------------------------------------------------- /* entropy.c -- Calculate monogram entropies of standard input. */ #include <stdio.h> #include <math.h> /* This next define is to counteract the foolish C 7.00 runtime mapping * of \x1A to EOF */ #define STUPID_NEWLINE_TRANSLATE 1 #ifdef STUPID_NEWLINE_TRANSLATE #include <io.h> #include <fcntl.h> #endif /*--------------------------------------*/ #define NUMBER_OF_BYTES 256 long byte_freq[ NUMBER_OF_BYTES ] ; void main() { int c, verbose = 0 ; unsigned int j ; double entropy( long *, int ) ; #if STUPID_NEWLINE_TRANSLATE _setmode( _fileno( stdin ), _O_BINARY ) ; #endif for ( j = 0 ; j < NUMBER_OF_BYTES ; ++j ) { byte_freq[ j ] = 0 ; } while ( EOF != ( c = getchar() ) ) { ++ byte_freq[ (unsigned int) c ] ; } if ( verbose ) { for ( j = 0 ; j < NUMBER_OF_BYTES ; ++j ) { printf( "%3d=%-3d ", j, byte_freq[ j ] ) ; if ( j % 8 == 7 ) printf( "\n" ) ; } } printf( "%lf\n", entropy( byte_freq, NUMBER_OF_BYTES ) ) ; } /*--------------------------------------*/ /* Calculates the entropy of the distribution given in list v of n elements. * The list v gives counts. v is summed, and frequencies are assigned * as v[i]/sum(v). * * Uses the following definitions and identities: * A = \Sum_{i=0}^{n-1} v_i * p_i = v_i / A * H = \Sum_{i} - p_i \log p_i * = log A - 1/A \Sum_{i} v_i \log v_i * lim_{x \rightarrow 0} x \log x = 0 */ double entropy( long *v, int n ) { double h ; long sum ; int j ; /* first sum the array */ sum = 0 ; for ( j = 0 ; j < n ; ++j ) { sum += v[ j ] ; } /* next calculate the entropy function */ h = 0 ; for ( j = 0 ; j < n ; ++ j ) { /* If the frequency is zero, the entropy contribution is zero */ if ( v[ j ] == 0 ) continue ; h -= v[ j ] * log( v[ j ] ) ; } h /= sum ; h += log( sum ) ; /* Now adjust the base of the logarithm to base 2 */ h /= log( 2 ) ; return h ; }
participants (1)
-
Eric Hughes