Re: why compression doesn't perfectly even out entropy

17 Dec 2003

      From: daw@cs.berkeley.edu (David Wagner)
...
My entropy cruncher takes in random noise from a number of diverse
sources (some possibly of dubious quality).  I take *all* the noise
and run it through a hash function to distill entropy.
Now I need to have some method to estimate when I have enough entropy
in the random noise I'm crunching.  First rule: be conservative.
One can never have too much entropy in the input to the hash function.
Therefore, I suggest making a *copy* of the input noise stream,
running it through Jon Wienke's "this shouldn't happen" filter, and
feeding the result to some entropy estimator.  When the entropy
estimator says "I've got 1000 bits of entropy", I stop crunching.
This is conservative design, folks.  Using Wienke's filter in this manner
can't be any weaker than not using it at all. (agreed?)
I see two problems with this.

The first is whether this mysterious black box, the entropy estimator,
is really possible.  In practice the only way to know how much entropy
you've gotten is to have a model for how the data is being generated,
and to deduce from that an estimate of the entropy rate.  So the entropy
estimator can't be a general-purpose calcluation, but it must be one
which is specifically chosen, developed and tuned for the specific source
of entropy you are dealing with.

Given this, what is the point of filtering?  You already have a model.
If you want to be conservative, why not just take 50% more bits than your
model says you needed?

The other problem is the functioning of this filter.  I haven't followed
Jon's proposals closely, but at one point he was talking about
histogramming the input and throwing out data which he had seen too
often.  Now this is an implicit model as well - it assumes that the data
is supposed to be uniformly distributed on a per-byte (or whatever the
data elements are) basis.

Suppose your random noise from dubious sources includes some timing
values which vary in the range 90-110, roughly normally distributed.  You
have good reason to believe that it actually is a normal distribution,
and that there are 2 or 3 good bits of entropy per sample.  If you didn't
use Jon's filter you could just collect data, hash it, and figure that
each datum gave you this much entropy.

But now if you throw Jon's filter in there, it may start throwing out all
the values in the range 90-110.  Where are the 0-80's?, it wonders.  Where
are the 120's and up?  There are way too many 100's here!  If the filter
isn't smart about the data like your model is, it could end up throwing
the whole data set out.  Your entropy counter would be spinning its
wheels waiting for more data, and you'd think you never got enough.

So I think the lesson is that there is only one way to estimate entropy,
and that is to study your source.  I have to agree with Perry that this
filtering concept is not the way to go.  It is a red herring that lures
you in the direction of automatic entropy estimation, and that is really
not safe.

Hal Finney