27/8/2017, 16:50

Bias in the .NET random number generator

In this post we will see that when prompted to provide random positive integers, the default .NET Framework random number generator has a strong bias in its least significant bit.



Background


At my work, a good amount of time is spent doing simulation modeling: Analyses of computational algorithms whose underlying variables are governed by probability distributions, and which may be studied by running the systems with concrete samples when actual analysis in infeasible. In fact, even when analytical studies of such systems are possible, simulation tends to be faster to implement. Moreover, working in a Microsoft-oriented company, much time is spent together with the .NET Framework whose class library includes a pseudo-random number generator, which we will refer to as System.Random.



For such a generator to be useful in the context of simulation, it must have certain properties. In particular we require that it has no apparent bias. For example, if a generator can produce integers, one way to use it for simulating the flipping of a fair coin would be to take such an integer, and associate it to the events of heads/tails if the integer is odd/even respectively. In this case, if the generator is biased towards even integers, it is clearly inadequate for the task at hand.



Now, one does not have to search long on StackOverflow to find claims that System.Random is not useful in the context of simulation, but what is harder to find are good reasons for why this would be the case; in particular, the Microsoft documentation of the generator makes no such disclaimers.



Here, we will show that System.Random.Next(0, int.MaxValue) and System.Random.Next(int.MaxValue), which both give a "random" integer between \(0\) and \(2^{31}-2\) (both inclusive), have much greater probabilities of returning odd numbers than even numbers. This is the case for version 4.7 and earlier versions of the .NET Framework.



The algorithm


The source code for System.Random is available through Microsoft's Reference Source, allowing us to directly review the algorithm for potential issues. The algorithm for Random.Next(b), which gives an integer between \(0\) and \(b-1\) amounts to the following:

1. Take a random integer between \(0\) and \(2^{31}-2\).
2. Divide the integer by \(2^{31}-1\) to get a double-precision floating-point between \(0\) and \(1\).
3. Multiply this floating-point with \(b\) to get a number between \(0\) and \(b\).
4. Take the integer part of this new number to get an integer between \(0\) and \(b-1\).

Arguably the first point above is the most interesting one, but for the purposes of this post, we will simply assume that the integer random number generation is perfect, and consider only the implications of the floating-point conversions that follow. As it turns out, the integer generation itself has a number of issues, which I might cover in a follow-up post; for now, let me refer to this draft in which the algorithm, and some of its issues, are discussed in detail.



Floating-points and rounding errors


Let us now focus on the case of \(b = 2^{31}-1\), and let \(m\) be the output of the first step above, that is, \(m \in \{0, \dots, 2^{31}-2\}\). Then, the number output by Random.Next(b) is x = (int)((m * (1.0/int.MaxValue)) * int.MaxValue), the result of multiplying \(m\) by the double-precision floating-point \(1/(2^{31}-1)\) (whose byte representation is 0x3E00000000200000), and afterwards by the integer \(2^{31}-1\) (represented as 0x7FFFFFFF) – both operations as floating-point operations – and taking the integer part of the result.



This would be entirely unproblematic if not for the fact that we risk incurring rounding errors in the process. Indeed, if \(f : \{0, \dots, 2^{31}-2\} \to \{0, \dots, 2^{31}-2\}\) denotes the composition of the multiplications, we find that
\[f(m) = \begin{cases} m-1, & m = 2^{n+22} + 2^{n+2}k + 2^n,\ n \in \{0, \dots, 8\},\ k \in \{0, \dots, 2^{21}-2^{20}-1\}, \\ m, & \text{otherwise.}\end{cases}\] I wish I had a nice analytical proof of this relation, but it came about only through experimentation and searching for patterns. To see that it holds, we simply check all values of \(m\); the following piece of C handles that in seconds:


#include 

int main()
{
int maxInt = 2147483647;
double macIntRec = 1.0 / maxInt;
int n = 1; // Index of most significant bit
int modulus;
int period;
for (int m = 1; m < maxInt; m++)
{
// When m == 2^n, update criterion for f(m) == m
if (m == 1 << n) {
printf("m = 2^%d\n", n);
modulus = 1 << n - 22;
period = (1 << n - 20) - 1;
n++;
}
int x = (int)(m * macIntRec * maxInt);
// Test m % 2^(n-20) == 2^(n-22)
if ((m & period) == modulus) x++;
if (x != m) printf("Mistake at m: %d\n", m); // This never happens
}
return 0;
}

With this in place, we see our source of bias: From the case \(n = 0\), there are \(2^{21}-2^{20}-1\) odd numbers that are mapped to even numbers after rounding, but from the cases \(n = 1, \dots, 8\), we get \(8 \cdot (2^{21}-2^{20}-1)\) even numbers that are mapped to odd numbers.



In other words, if \(m\) was taken uniformly at random, the probability that \(f(m)\) is odd is \(0.5034\), which, for the purposes of simulation, is a no-go.



For the meticulous, let us quickly note that \(\{0, \dots, 2^{31}-2\}\) contains more even numbers that odd numbers; this clearly makes no difference for the argument (and in fact only makes the situation worse).



An experiment


The above tells us that Random.Next(int.MaxValue) should output many more odd than even numbers. The argument relied on the integer random number generation actually working. Since this algorithm itself comes with bias, we should include in our discussion an actual .NET example, so let us come back to where we started and simulate coin flipping with C#:


public static void Main()
{
var rng = new Random(0);
var sampleSize = 1000000;
var numbers = Enumerable.Repeat(0, sampleSize).Select(_ => rng.Next(0, int.MaxValue));
var heads = numbers.Count(x => x % 2 == 1);
var tails = sampleSize - heads;
Console.WriteLine($"Heads: {heads}, tails: {tails}");
// Output: Heads: 503291, tails: 496709
}

As expected, odd numbers are overrepresented in the output, and the question becomes: How overrepresented? A result of 500000-500000 would probably look even more dubious, and a bit of care needs to be taken in interpreting ratios such as the above.



A frequentist statistician will tell you that in order to properly rule out an outcome such as the above, one needs to consider how likely that outcome would be had we used a truly random generator. As such, we need to calculate the probability of obtaining \(496709\) tails from a million flips of a fair coin, or an even more extreme outcome. This probability is given by the cumulative distribution function of the binomial distribution; we calculate this using the Python library SciPy.


In [1]: import scipy.stats as st

In [2]: 2*st.binom(1000000, 0.5).cdf(496709)
Out[2]: 4.6722165389139607e-11

That is, were we to flip a million coins, the probability of seeing our observed result, or something more extreme, is \(0.0000000046\%\). Not super likely.



import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

x = np.arange(4.95e5, 5.05e5)
c = st.binom(1e6, 0.5).pmf(x)
plt.ylabel('Probability density')
plt.xlabel('Number of heads')
plt.title('Outcomes of a million coin flips')

ax = plt.axes()
bbox_props = dict(boxstyle="rarrow", fc=(0.8,0.9,0.9), ec=(0.5,0.5,0.9), lw=0.5)
t = ax.text(503291, 0.00025, "System.Random", ha="center", va="center", rotation=-90,
size=15, bbox=bbox_props)
plt.gcf().subplots_adjust(left=0.15)
plt.plot(x, c)


Final remarks


The main takeaway from this, besides a warning against using System.Random for anything that requires random numbers, should be that even if we had taken as our starting point something truly random, the fact that the innocent-looking type conversions wrecked chaos indicates how every part of random number generation needs to be based on theory. Citing Knuth, "random numbers should not be generated with a method chosen at random. Some theory should be used".



In the special case of System.Random we should note that one can avoid the floating-point conversions by invoking Random.Next() directly. As mentioned elsewhere, this method has other issues meaning that it can not exactly be taken to be based on theory.



Microsoft have been made aware of the various issues surrounding System.Random. My hope is that .NET Core will evolve to support proper random number generation by default.




Comments


No comments yet.


Add a comment

To avoid too much spam, you'll have to log in to add a comment. You can do so below, or you can create a user, if you don't already have one. You can use your photo login here as well.


E-mail address:
Password:  Forgot your password?
Captcha:

[ Different Image ]
CAPTCHA Image