[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Estimating Population Summary
Summary of: Estimating Population from Repetitions in
Accumulated Random Samples
In the latest (April 1994) issue of Cryptologia, I describe the
development of a new technique for the statistical estimation of
population. An example of such a problem would be estimating the
number of different values or codes produced by a physically-
random number generator.
Background
This work is an outgrowth of a sci.crypt discussion in early 1992
in which Nico de Vries promoted as "physically-random" a computer
program which made use of variations between software and "IBM PC"
hardware timing. It was difficult to know how one could determine
the amount of "state" (and, thus, the limit of "randomness") in
such a mechanism. Ross Anderson suggested measurement using the
"birthday paradox."
The Experimental Procedure
The experimenter will obtain a value from the RNG and save it,
repeating this for some fixed number of random samples, a "trial."
Each new sample must be compared to all previous samples to see if
there is a match or "exact double." (The birthday paradox does not
apply to those statistical RNG's which are designed to produce a
sequence without value repetition.) A trial contains enough samples
if, on average, it produces a few doubles. About 2.5 or 3 Sqrt(N)
samples will be needed, given population N, but N is the value we
wish to measure. Producing and saving N samples may not be trivial.
Exact Repetitions
In a single trial, if we find two occurrences of some value, we
have a single level-two "repetition"; this is an "exact" repetition
count. But if we then find another occurrence of the same value,
we have a level-three repetition and no level-two repetitions.
Note how increased information (another occurrence) results in
reduced effectiveness in the level-two measurement statistic.
Expectations
Classical binomial equations can predict the number of expected
exact repetitions for a given population and number of samples.
But these equations are extremely difficult to reverse for use in
predicting population. Trying to use these equations with numerical
root-finding techniques produces ambiguous results, as there are
generally multiple roots. Equations which _estimate_ the probability
of repetitions are well known, but it was not previously clear how
accurate these would be, how they could be used effectively, what
they would mean in random sampling distribution, or how they could
be generalized to higher repetition levels.
Augmented Repetitions
I have found a new, simple, exact, and easily-reversed combinatoric
relationship between population and a value which I call "augmented
repetitions." An "augmented double" consists of the number of
exact doubles (exactly two samples which have the same value),
_plus_ contributions from exact triples, exact quads, etc.
An exact triple may be seen as three doubles: There are three ways
in which an exact triple may produce exact doubles. Therefore, for
augmentation purposes, a triple should count as three augmented
doubles. Similarly, a quad or exact 4-rep may be
4
seen as ( ) or 6 doubles, the number of combinations of four
2
things taken two at a time. When we do this, we find that simple
equations predict the result _exactly_.
Thus, the number of augmented repetitions at the kth level (k = 2
means doubles), given r exact repetitions at level i is:
i
n i
ar = SUM ( ) r .
k i=1 k i
(This is equation 2.3 which very unfortunately was printed
incorrectly in the article.)
That is, we multiply the number of exact matches at each level by
the effective number of matches each could produce at the lower
level, and accumulate an overall sum.
Augmented Doubles and Population
Given population N, the expected number of augmented doubles Ead
found in s samples is _exactly_:
s (s - 1)
Ead(N,s) = --------- .
2 N
Given population N = 10,000 (so Sqrt(N) = 100), we can show
the expected number of augmented doubles for various numbers
of samples:
s Ead
-----------
100 0.495
150 1.118
200 1.990
250 3.113
300 4.485
400 7.980
The formula implies, of course, that the population N is
related to augmented doubles ad and samples s as:
s (s - 1)
Nad(s,ad) = ---------
2 ad
which is the desired simple form for estimating population.
Distribution
A major issue in population measurement is the fact that the
number of augmented doubles varies greatly over similar trials
on the exact same population. Thus, a single trial is essentially
meaningless for estimating population.
Experiments indicate that various numbers of augmented doubles
occur in Poisson distribution over different trials, a result
which also has theoretical support. Therefore, we should develop
an arithmetic mean or expected value which is the Poisson parameter.
The Poisson distribution is asymmetric, and changes radically for
different expected values. In general it will be necessary to
perform tens or hundreds of separate trials to develop an accurate
mean for population estimation. It is worthwhile to accumulate the
entire distribution (rather than just a simple mean), and compare
that shape with the ideal shape of the Poisson distribution for
the given mean.
The Poisson distribution also gives us a way to talk about the
probability of finding augmented doubles Ead:
-Ead
Pd(N,s) = 1 - e .
So, for population N = 10,000:
s Ead Pd
------------------
100 0.495 0.39
150 1.118 0.67
200 1.990 0.86
250 3.113 0.96
300 4.485 0.99
400 7.980 0.9997
It is often stated that the birthday paradox predicts a match
with the sample size s = Sqrt(N), but this value is actually a
little small; the expected number of augmented doubles for
s = Sqrt(N) is 0.5 (and there are at least as many augmented
doubles as exact doubles). Thus, if we want one augmented double
on average, we need something like s = 1.5 Sqrt(N) samples. But it
is beneficial to move the Poisson distribution toward a symmetric
Normal curve, so 2.5 Sqrt(N) or 3 Sqrt(N) are reasonable
experimental minimums.
The Advance
A new statistically-exact combinatoric relationship has been found
between population and value repetition in random trials. Since
previous well-known estimates could be used for rough estimates,
it is not clear that this is a breakthrough in practice.
However, the identification of an applicable _exact_ relationship,
and its expected distribution in random trials, is important in
that it clarifies what we can expect to see in actual use.
The paper starts with simple probability, limits itself to algebra
and statistics, discusses the existing techniques for exact and
other repetitions, and develops general expressions for augmented
repetitions. It also has tables of all possible trials for some
tiny populations, whose resulting repetition values correspond to
predictions exactly. The paper also has some nice graphs of
experimental results on larger populations, which show a real
Poisson distribution in action, and tables show the effect of
estimating population from the experimental results.
---
Terry Ritter [email protected]