27/8/2017, 16:50

Bias in the .NET random number generator

In this post we will see that when prompted to provide random positive integers, the default .NET Framework random number generator has a strong bias in its least significant bit.



Background


At my work, a good amount of time is spent doing simulation modeling: Analyses of computational algorithms whose underlying variables are governed by probability distributions, and which may be studied by running the systems with concrete samples when actual analysis in infeasible. In fact, even when analytical studies of such systems are possible, simulation tends to be faster to implement. Moreover, working in a Microsoft-oriented company, much time is spent together with the .NET Framework whose class library includes a pseudo-random number generator, which we will refer to as System.Random.



For such a generator to be useful in the context of simulation, it must have certain properties. In particular we require that it has no apparent bias. For example, if a generator can produce integers, one way to use it for simulating the flipping of a fair coin would be to take such an integer, and associate it to the events of heads/tails if the integer is odd/even respectively. In this case, if the generator is biased towards even integers, it is clearly inadequate for the task at hand.



Now, one does not have to search long on StackOverflow to find claims that System.Random is not useful in the context of simulation, but what is harder to find are good reasons for why this would be the case; in particular, the Microsoft documentation of the generator makes no such disclaimers.



Here, we will show that System.Random.Next(0, int.MaxValue) and System.Random.Next(int.MaxValue), which both give a "random" integer between \(0\) and \(2^{31}-2\) (both inclusive), have much greater probabilities of returning odd numbers than even numbers. This is the case for version 4.7 and earlier versions of the .NET Framework.



The algorithm


The source code for System.Random is available through Microsoft's Reference Source, allowing us to directly review the algorithm for potential issues. The algorithm for Random.Next(b), which gives an integer between \(0\) and \(b-1\) amounts to the following:

1. Take a random integer between \(0\) and \(2^{31}-2\).
2. Divide the integer by \(2^{31}-1\) to get a double-precision floating-point between \(0\) and \(1\).
3. Multiply this floating-point with \(b\) to get a number between \(0\) and \(b\).
4. Take the integer part of this new number to get an integer between \(0\) and \(b-1\).

Arguably the first point above is the most interesting one, but for the purposes of this post, we will simply assume that the integer random number generation is perfect, and consider only the implications of the floating-point conversions that follow. As it turns out, the integer generation itself has a number of issues, which I might cover in a follow-up post; for now, let me refer to this draft in which the algorithm, and some of its issues, are discussed in detail.



Floating-points and rounding errors


Let us now focus on the case of \(b = 2^{31}-1\), and let \(m\) be the output of the first step above, that is, \(m \in \{0, \dots, 2^{31}-2\}\). Then, the number output by Random.Next(b) is x = (int)((m * (1.0/int.MaxValue)) * int.MaxValue), the result of multiplying \(m\) by the double-precision floating-point \(1/(2^{31}-1)\) (whose byte representation is 0x3E00000000200000), and afterwards by the integer \(2^{31}-1\) (represented as 0x7FFFFFFF) – both operations as floating-point operations – and taking the integer part of the result.



This would be entirely unproblematic if not for the fact that we risk incurring rounding errors in the process. Indeed, if \(f : \{0, \dots, 2^{31}-2\} \to \{0, \dots, 2^{31}-2\}\) denotes the composition of the multiplications, we find that
\[f(m) = \begin{cases} m-1, & m = 2^{n+22} + 2^{n+2}k + 2^n,\ n \in \{0, \dots, 8\},\ k \in \{0, \dots, 2^{21}-2^{20}-1\}, \\ m, & \text{otherwise.}\end{cases}\] I wish I had a nice analytical proof of this relation, but it came about only through experimentation and searching for patterns. To see that it holds, we simply check all values of \(m\); the following piece of C handles that in seconds:


#include <stdio.h>

int main()
{
int maxInt = 2147483647;
double macIntRec = 1.0 / maxInt;
int n = 1; // Index of most significant bit
int modulus;
int period;
for (int m = 1; m < maxInt; m++)
{
// When m == 2^n, update criterion for f(m) == m
if (m == 1 << n) {
printf("m = 2^%d\n", n);
modulus = 1 << n - 22;
period = (1 << n - 20) - 1;
n++;
}
int x = (int)(m * macIntRec * maxInt);
// Test m % 2^(n-20) == 2^(n-22)
if ((m & period) == modulus) x++;
if (x != m) printf("Mistake at m: %d\n", m); // This never happens
}
return 0;
}

With this in place, we see our source of bias: From the case \(n = 0\), there are \(2^{21}-2^{20}-1\) odd numbers that are mapped to even numbers after rounding, but from the cases \(n = 1, \dots, 8\), we get \(8 \cdot (2^{21}-2^{20}-1)\) even numbers that are mapped to odd numbers.



In other words, if \(m\) was taken uniformly at random, the probability that \(f(m)\) is odd is \(0.5034\), which, for the purposes of simulation, is a no-go.



For the meticulous, let us quickly note that \(\{0, \dots, 2^{31}-2\}\) contains more even numbers that odd numbers; this clearly makes no difference for the argument (and in fact only makes the situation worse).



An experiment


The above tells us that Random.Next(int.MaxValue) should output many more odd than even numbers. The argument relied on the integer random number generation actually working. Since this algorithm itself comes with bias, we should include in our discussion an actual .NET example, so let us come back to where we started and simulate coin flipping with C#:


public static void Main()
{
var rng = new Random(0);
var sampleSize = 1000000;
var numbers = Enumerable.Repeat(0, sampleSize).Select(_ => rng.Next(0, int.MaxValue));
var heads = numbers.Count(x => x % 2 == 1);
var tails = sampleSize - heads;
Console.WriteLine($"Heads: {heads}, tails: {tails}");
// Output: Heads: 503291, tails: 496709
}

As expected, odd numbers are overrepresented in the output, and the question becomes: How overrepresented? A result of 500000-500000 would probably look even more dubious, and a bit of care needs to be taken in interpreting ratios such as the above.



A frequentist statistician will tell you that in order to properly rule out an outcome such as the above, one needs to consider how likely that outcome would be had we used a truly random generator. As such, we need to calculate the probability of obtaining \(496709\) tails from a million flips of a fair coin, or an even more extreme outcome. This probability is given by the cumulative distribution function of the binomial distribution; we calculate this using the Python library SciPy.


In [1]: import scipy.stats as st

In [2]: 2*st.binom(1000000, 0.5).cdf(496709)
Out[2]: 4.6722165389139607e-11

That is, were we to flip a million coins, the probability of seeing our observed result, or something more extreme, is \(0.0000000046\%\). Not super likely.



import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

x = np.arange(4.95e5, 5.05e5)
c = st.binom(1e6, 0.5).pmf(x)
plt.ylabel('Probability density')
plt.xlabel('Number of heads')
plt.title('Outcomes of a million coin flips')

ax = plt.axes()
bbox_props = dict(boxstyle="rarrow", fc=(0.8,0.9,0.9), ec=(0.5,0.5,0.9), lw=0.5)
t = ax.text(503291, 0.00025, "System.Random", ha="center", va="center", rotation=-90,
size=15, bbox=bbox_props)
plt.gcf().subplots_adjust(left=0.15)
plt.plot(x, c)


Final remarks


The main takeaway from this, besides a warning against using System.Random for anything that requires random numbers, should be that even if we had taken as our starting point something truly random, the fact that the innocent-looking type conversions wrecked chaos indicates how every part of random number generation needs to be based on theory. Citing Knuth, "random numbers should not be generated with a method chosen at random. Some theory should be used".



In the special case of System.Random we should note that one can avoid the floating-point conversions by invoking Random.Next() directly. As mentioned elsewhere, this method has other issues meaning that it can not exactly be taken to be based on theory.



Microsoft have been made aware of the various issues surrounding System.Random. My hope is that .NET Core will evolve to support proper random number generation by default.


View comments (0)

Tags: maths random .net.python

30/7/2015, 16:16

Fun with deepdream

Towards the end of the month of June, Google released the Python source code for deepdream: A piece of software which allows for the visualisation of the potential of a collection of artificial neural networks set up for image recognition. I can't exactly claim that I dug into the technical details, but Google produced a blog post which explains in rough terms the capabilities of the software, and what is going on. Let me not attempt to reproduce that as I would not be able to do a better job than Google anyway.

Basically, the overall objective is to recognize given objects in arbitrary images. Deepdream allows insight into what the algorithms applied actually find as the findings are emphasized in the original test images. With this, one can do as I – and a huge number of other people – did, and fire up the code to produce some very silly art.



Pictures


The picture below is the result of running a photograph of Svartifoss through the GoogLeNet network; Google's contribution to an image recognition competition that ran last year. Click the image to see a bigger version. (Do it!)



One of my personal favourites is the following result of running through a given layer of the network at various numbers of iterations a photograph of a bonfire:



Another thing one can do – also included in the before-mentioned freely available example – is to iterate the recognition process while zooming in on a given point of the image. In my vanity, I asked deepdream to check the contents of one of my eyebrows:



Setting up deepdream


If the reader wants to play around with deepdream themselves, it is not difficult to find setup guides for various systems. Personally I found this guide for Debian useful. The only problem I ran into was that Caffe currently struggles with g++ 4.9.

When compiling, note that it takes an NVIDIA card to run Caffe on the GPU.


View comments (0)

Tags: deepdream pythonThis post is available in Danish.

23/6/2015, 12:53

Basic topology: The notes

I recently completed a 40 hour course on "Basic topology". The course covered the fundamentals on topological and metric spaces, including properties such as connectedness, compactness, and separation. At the end, we were able to talk in detail about elementary algebraic topology and explicitly find the fundamental groups of some simple topological spaces.

If you're curious about how one spends 40 hours on that, or if you're curious about what is going on in the preceding paragraph of this post, check out the 93 page course notes that came out of it.


View comments (0)

Tags: maths topologyThis post is available in Danish.

21/6/2015, 18:58

Agar with friends

This post describes how to set up a game with your friends in the online game agar.io. Note that this post is very likely to get outdated soon after the time of writing as the game is still under development.





Background


Agar.io is an addictive JavaScript based game in which you control a ball which eats balls that are smaller than itself. And which gets eaten by balls that are bigger than itself if you are not careful. Everybody is playing it. It's fun.

Now, as most online games, the game is a lot more fun if you play it with people you know. Until yesterday, people would accomplish this by simply manipulating the JavaScript to connect to the same game. In some cases, though, this could ruin games as people without teams would become overrun, and the feature was disabled.

As often happens, people quickly found a way to circumvent the newly introduced protection. Moreover, the fix is currently monetised on Chrome's Webshop. Monetising something that's not your own work like that is generally rather poor taste, so here is one way to obtain the same result as the monetised plugin.



The fix


Please be aware that everything here can be deduced directly from the slightly obfuscated client source code. I have no idea what the servers do.

In principle, here's how the initial connection works in the game: the client asks m.agar.io for a server in a given part of the world. The site replies with an IP for a server and a one-time code to connect to it.

Now, there's only so many servers out there so by simply repeating the request, one will eventually get more one-time codes for the same server, which can then be shared among one's friends. The procedure below explains how to do this in a way that is feasible for perhaps 2–3 people. It would be straight-forward to implement in a plugin to allow for an arbitrary number of people to join the same server, but I am not going to do that.



How to use this in practice


Step 1. Go to http://agar.io – the HTTPS version of the site also works but appears to be a bit buggy.

Step 2. Open the browser's developer console (right click → 'inspect element' → 'console' in Firefox/Chrome). Paste the following piece of JavaScript:


function getServer() {
jQuery.ajax("http://m.agar.io/", {
success: function(a) {
a = a.split("\n");
console.log("connect(\"ws://" + a[0] + "\", \"" + a[1] + "\")")
},
dataType: "text",
method: "POST",
data: "EU-London"
})
}

for (i=0;i<20;i++) {
getServer();
}

In this, replace "EU-London" with a location close to you. To see the currently active server locations, search for "EU-London" in the client source code.

Step 3. This produces a list of servers that you can connect to. To connect to a given one, simply copy one of the lines and enter it into the browser console. Now find some lines with duplicate IP:ports and share those with your friends. Remember that each line can only be used once.

Step 4. Repeat! For whatever reason (my hypothesis being that servers tend to be full), this does not necessarily work the first time around and people end up on different servers. In my experience though, a few attempts will typically suffice.


View comments (2)

Tags: agar gaming javascriptThis post is available in Danish.

29/3/2015, 10:55

Func KB-460 and Linux

This will be one of those posts that I wish I would have been able to find myself when I encountered the problems I did.

After having been forced to listen to the hype about mechanical keyboards for way too long, I recently got myself a mechanical Func KB-460, featuring some deliciously noisy Cherry MX Blues and a red back-light that you can switch off if you're worried about getting the attention it might attract. Overall, I'm satisfied with the keyboard; some parts of it seem a little flimsy but at the moment, it is in the lower end of the mechanical keyboard price spectrum, so that is to be expected.



The problem


While using the keyboard, I quickly noticed a peculiar key duplication problem: whenever I would press a given key and the '-key in quick succession, the '-key would become duplicated several times; I'm would become I'm'', and you're would become you'r''e. You can imagine how annoying this would be when producing code. Meanwhile, the keyboard appears to not run into the issue on Linux.

Func is well aware of the problem, and while they do not provide Linux support, they guided me to this kernel bug which explains in detail what is going on, and how it affects not only the Func keyboard but apparently several keyboards with n-key rollover and various keyboard layouts. Roughly, the '-key registers as two different keys at the same time, which causes buggy behaviour when key strokes are performed quickly.



The fix


As is evident from the kernel.org thread a kernel patch exists and will likely be rolled out soon. Until then, there is a simply fix for it which requires no kernel patching at all. It has been tested on two separate Debian installations and probably works on other distros as well; it's similar to the hwdb solution provided by a Fedora user on kernel.org. This one uses keymap instead.

In essence, we will kill one of the two keys associated to the '-key so that only one of them registers. First, create a file /lib/udev/keymaps/func containing a single line


0x70031 reserved

corresponding to the scancode of the key. Next, to find the relevant input correponding to the keyboard, run cat /proc/bus/input/devices. You will see a block like



I: Bus=0003 Vendor=195d Product=2030 Version=0111
N: Name="Gaming keyboard"
P: Phys=usb-0000:00:12.2-3.2/input2
S: Sysfs=/devices/pci0000:00/0000:00:12.2/usb1/1-3/1-3.2/1-3.2:1.2/input/input7
U: Uniq=
H: Handlers=sysrq kbd event7
B: PROP=0
B: EV=100013
B: KEY=7f80000000000000 e0b0ffdf01cfffff fffffffffffffffe
B: MSC=10

Notice that the keyboard will list as several devices; simply taking the last one seems to always do the trick. Now, run the command /lib/udev/keymap /dev/input/event7 /lib/udev/keymaps/func and you should be good to go. Here, the "event7" comes from the output above.

Now you could simply include this single line in /etc/rc.local to have the code run on launch but be aware that the input numbers seem to change somewhat arbitrarily in my experience.


View comments (0)

Tags: tech linux