[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
entropy, with code
The entropy-calculating code is at the end of this message. I took
the opportunity to calculate some sample entropies:
entropy.c 5.283772 the source code
entropy.asc 6.052222 entropy.c, encrypted and armored
entropy.as2 6.012493 entropy.asc, with the wrappers removed
entropy.pgp 7.830532 entropy.c, encrypted alone
entropy.obj 6.112890
entropy.exe 6.947111
randseed.bin 4.584963 from pgp, 24 bytes long
pubring.pgp 7.754017 my public key ring
The entropy of the source code is in the high end of the range for
English, which is not surprising given the amount of symbols in
ordinary C code. The entropy increases with the object and the
executable, each of which has less overt structure to it.
The entropy of the encrypted and ascii armored source code is within
1% of 6 bits, just as predicted. And with the wrappers removed, it's
even closer! The binary version of the encrypted message has the
highest entropy of all these tests.
In randseed.bin, the entropy is much less than 8. But the length of
the file is 24 bytes and log_2 24 = 4.584963, indicating that there
are no duplicate bytes in the file. Hence the warning: entropy
calculations with small samples _will_ be misleading.
Note that the entropy subroutine can be used to calculate the
frequencies of any distribution. This will allow all you code-writing
cypherpunks to measure bit entropies, digram entropies, etc.
Eric
----------------------------------------------------------------
/* entropy.c -- Calculate monogram entropies of standard input.
*/
#include <stdio.h>
#include <math.h>
/* This next define is to counteract the foolish C 7.00 runtime mapping
* of \x1A to EOF
*/
#define STUPID_NEWLINE_TRANSLATE 1
#ifdef STUPID_NEWLINE_TRANSLATE
#include <io.h>
#include <fcntl.h>
#endif
/*--------------------------------------*/
#define NUMBER_OF_BYTES 256
long byte_freq[ NUMBER_OF_BYTES ] ;
void
main()
{
int c, verbose = 0 ;
unsigned int j ;
double entropy( long *, int ) ;
#if STUPID_NEWLINE_TRANSLATE
_setmode( _fileno( stdin ), _O_BINARY ) ;
#endif
for ( j = 0 ; j < NUMBER_OF_BYTES ; ++j ) {
byte_freq[ j ] = 0 ;
}
while ( EOF != ( c = getchar() ) ) {
++ byte_freq[ (unsigned int) c ] ;
}
if ( verbose ) {
for ( j = 0 ; j < NUMBER_OF_BYTES ; ++j ) {
printf( "%3d=%-3d ", j, byte_freq[ j ] ) ;
if ( j % 8 == 7 ) printf( "\n" ) ;
}
}
printf( "%lf\n", entropy( byte_freq, NUMBER_OF_BYTES ) ) ;
}
/*--------------------------------------*/
/* Calculates the entropy of the distribution given in list v of n elements.
* The list v gives counts. v is summed, and frequencies are assigned
* as v[i]/sum(v).
*
* Uses the following definitions and identities:
* A = \Sum_{i=0}^{n-1} v_i
* p_i = v_i / A
* H = \Sum_{i} - p_i \log p_i
* = log A - 1/A \Sum_{i} v_i \log v_i
* lim_{x \rightarrow 0} x \log x = 0
*/
double
entropy( long *v, int n )
{
double h ;
long sum ;
int j ;
/* first sum the array
*/
sum = 0 ;
for ( j = 0 ; j < n ; ++j ) {
sum += v[ j ] ;
}
/* next calculate the entropy function
*/
h = 0 ;
for ( j = 0 ; j < n ; ++ j ) {
/* If the frequency is zero, the entropy contribution is zero
*/
if ( v[ j ] == 0 ) continue ;
h -= v[ j ] * log( v[ j ] ) ;
}
h /= sum ;
h += log( sum ) ;
/* Now adjust the base of the logarithm to base 2
*/
h /= log( 2 ) ;
return h ;
}