Summary of: Estimating Population from Repetitions in Accumulated Random Samples In the latest (April 1994) issue of Cryptologia, I describe the development of a new technique for the statistical estimation of population. An example of such a problem would be estimating the number of different values or codes produced by a physically- random number generator. Background This work is an outgrowth of a sci.crypt discussion in early 1992 in which Nico de Vries promoted as "physically-random" a computer program which made use of variations between software and "IBM PC" hardware timing. It was difficult to know how one could determine the amount of "state" (and, thus, the limit of "randomness") in such a mechanism. Ross Anderson suggested measurement using the "birthday paradox." The Experimental Procedure The experimenter will obtain a value from the RNG and save it, repeating this for some fixed number of random samples, a "trial." Each new sample must be compared to all previous samples to see if there is a match or "exact double." (The birthday paradox does not apply to those statistical RNG's which are designed to produce a sequence without value repetition.) A trial contains enough samples if, on average, it produces a few doubles. About 2.5 or 3 Sqrt(N) samples will be needed, given population N, but N is the value we wish to measure. Producing and saving N samples may not be trivial. Exact Repetitions In a single trial, if we find two occurrences of some value, we have a single level-two "repetition"; this is an "exact" repetition count. But if we then find another occurrence of the same value, we have a level-three repetition and no level-two repetitions. Note how increased information (another occurrence) results in reduced effectiveness in the level-two measurement statistic. Expectations Classical binomial equations can predict the number of expected exact repetitions for a given population and number of samples. But these equations are extremely difficult to reverse for use in predicting population. Trying to use these equations with numerical root-finding techniques produces ambiguous results, as there are generally multiple roots. Equations which _estimate_ the probability of repetitions are well known, but it was not previously clear how accurate these would be, how they could be used effectively, what they would mean in random sampling distribution, or how they could be generalized to higher repetition levels. Augmented Repetitions I have found a new, simple, exact, and easily-reversed combinatoric relationship between population and a value which I call "augmented repetitions." An "augmented double" consists of the number of exact doubles (exactly two samples which have the same value), _plus_ contributions from exact triples, exact quads, etc. An exact triple may be seen as three doubles: There are three ways in which an exact triple may produce exact doubles. Therefore, for augmentation purposes, a triple should count as three augmented doubles. Similarly, a quad or exact 4-rep may be 4 seen as ( ) or 6 doubles, the number of combinations of four 2 things taken two at a time. When we do this, we find that simple equations predict the result _exactly_. Thus, the number of augmented repetitions at the kth level (k = 2 means doubles), given r exact repetitions at level i is: i n i ar = SUM ( ) r . k i=1 k i (This is equation 2.3 which very unfortunately was printed incorrectly in the article.) That is, we multiply the number of exact matches at each level by the effective number of matches each could produce at the lower level, and accumulate an overall sum. Augmented Doubles and Population Given population N, the expected number of augmented doubles Ead found in s samples is _exactly_: s (s - 1) Ead(N,s) = --------- . 2 N Given population N = 10,000 (so Sqrt(N) = 100), we can show the expected number of augmented doubles for various numbers of samples: s Ead ----------- 100 0.495 150 1.118 200 1.990 250 3.113 300 4.485 400 7.980 The formula implies, of course, that the population N is related to augmented doubles ad and samples s as: s (s - 1) Nad(s,ad) = --------- 2 ad which is the desired simple form for estimating population. Distribution A major issue in population measurement is the fact that the number of augmented doubles varies greatly over similar trials on the exact same population. Thus, a single trial is essentially meaningless for estimating population. Experiments indicate that various numbers of augmented doubles occur in Poisson distribution over different trials, a result which also has theoretical support. Therefore, we should develop an arithmetic mean or expected value which is the Poisson parameter. The Poisson distribution is asymmetric, and changes radically for different expected values. In general it will be necessary to perform tens or hundreds of separate trials to develop an accurate mean for population estimation. It is worthwhile to accumulate the entire distribution (rather than just a simple mean), and compare that shape with the ideal shape of the Poisson distribution for the given mean. The Poisson distribution also gives us a way to talk about the probability of finding augmented doubles Ead: -Ead Pd(N,s) = 1 - e . So, for population N = 10,000: s Ead Pd ------------------ 100 0.495 0.39 150 1.118 0.67 200 1.990 0.86 250 3.113 0.96 300 4.485 0.99 400 7.980 0.9997 It is often stated that the birthday paradox predicts a match with the sample size s = Sqrt(N), but this value is actually a little small; the expected number of augmented doubles for s = Sqrt(N) is 0.5 (and there are at least as many augmented doubles as exact doubles). Thus, if we want one augmented double on average, we need something like s = 1.5 Sqrt(N) samples. But it is beneficial to move the Poisson distribution toward a symmetric Normal curve, so 2.5 Sqrt(N) or 3 Sqrt(N) are reasonable experimental minimums. The Advance A new statistically-exact combinatoric relationship has been found between population and value repetition in random trials. Since previous well-known estimates could be used for rough estimates, it is not clear that this is a breakthrough in practice. However, the identification of an applicable _exact_ relationship, and its expected distribution in random trials, is important in that it clarifies what we can expect to see in actual use. The paper starts with simple probability, limits itself to algebra and statistics, discusses the existing techniques for exact and other repetitions, and develops general expressions for augmented repetitions. It also has tables of all possible trials for some tiny populations, whose resulting repetition values correspond to predictions exactly. The paper also has some nice graphs of experimental results on larger populations, which show a real Poisson distribution in action, and tables show the effect of estimating population from the experimental results. --- Terry Ritter ritter@io.com