Monday 8 July 2013

What Every Computer Scientist Should Know About Random Numbers

This article outlines a few key things you should know about random numbers if you're a programmer, in the same vein as What Every Computer Scientist Should Know About Floating Point Arithmetic by David Goldberg (a more accessible version of which is available at The Floating-Point Guide) and Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

But this will be much shorter than those articles.

The Basics


What are random numbers?


A random number sequence is one where the next number in the sequence cannot be predicted, even if you know all of the previous numbers—i.e. one that contains no discernible pattern.

A random number generator is a device or program that produces such a pattern.

Why do we need random numbers?


They have applications in computer games, simulations and computer security. Simple examples would be picking the next track for a music player to select in shuffle mode or the order of a deck of cards in an online poker game. Additionally, many cryptography and security applications rely on starting with some unknown (and unpredictable) number or sequence of numbers, and their security falls down if they can be predicted.

How can a computer, which is deterministic, produce random numbers?


Unless you take special measures (see the next section) computers can't produce truly random numbers, so instead they do something pretty close that's usually good enough—they produce pseudo-random numbers, which are sequences of numbers that (i) appear random to humans and (ii) are statistically random. Such sequences generally have a uniform distribution—i.e. all numbers within the range of numbers produced by the random number generator are equally likely to be chosen. Another way of looking at this is that if you generate many pseudo-random numbers, each number within the range will be chosen roughly the same number of times, so no numbers will be preferred or chosen more frequently than any others. Of course there are other distributions you may want (e.g. a normal/Gaussian distribution), but usually you'll have a good reason for this (if you've never heard of normal or Gaussian distributions, they're very unlikely to be what you want).

Pseudo-random number sequences generally start with a seed value from which all the other numbers in the sequence follow using some algorithm. The algorithm itself is deterministic, and this is what makes the numbers it generates pseudo random as opposed to truly random.

What about if I want truly random numbers?


This is generally done using hardware entropy—the operating system detects things about hardware which cannot be predicted, such as when traffic arrives on your network port, and uses it to generate numbers that are truly random. On most UNIX systems these numbers can be accessed through /dev/random, which simply produces random bytes whenever you read from it. However, this method is a lot slower than producing pseudo-random numbers (just try running hd /dev/random on any UNIX system; wiggle the mouse or pound the keyboard to hurry things up), and doesn't have the advantage of being able to repeat the sequence for debugging/testing.

Random Number APIs


Most languages' standard libraries come with facilities to generate random numbers. They generally consist of at least two basic things:
  1. A way to seed a sequence of pseudo-random numbers, and
  2. A way to get the next number in the sequence.
Sometimes it's useful to have the same seed, for example for reproducing problems when debugging. Other times it's necessary to choose a seed from a source such as the current time in order to get a different pseudo-random sequence on each run of a program. Be careful choosing your seed from the time—depending on the resolution of the timer you're using, you could end-up with the same sequence being used on consecutive runs of a program (see an earlier post about this).

Common Mistakes when Producing Random Numbers


These are the mistakes most beginners make, especially when asking for help on forums.

Trying to make the numbers “more random”


It is a common mistake for people to assume that “pseudo-random” really means “not-very-random” and that they should try to somehow “improve” this randomness. This is typically done by calling some random number generation function (e.g. the standard C rand() or Java's Math.random()) several times and combining the results, as seen in this post on The Daily WTF, for example.

Producing random numbers with a uniform statistical distribution is very hard—you need some pretty advanced number theory to really know what you're doing and get it right. In many cases, combining several random numbers in this way produces a bias for/against some numbers, which actually makes these functions less random than they started out.

On the other hand if you do think you've produced better random numbers than some other implementation, NIST has some statistical tests you can download to check this.

Re-seeding the random number generator every time you need a new random number


This is another common mistake—a pseudo-random number sequence should generally be seeded once and only once. You'll only need to re-seed such a sequence if you want to reproduce a previous run of random numbers (e.g. for debugging/testing). Frequently when people attempt to re-seed the generator they make use of the current time, which can lead to the same “random” number being used several times in a row if the timer resolution isn't fine enough and the random number function is called often enough.

How You Should Produce Random Numbers...


How you should produce random numbers depends mostly on what's at stake—i.e. what the consequences would be if someone were able to predict the next random number in a sequence.

...when no personal data is at stake


This will be the case for most random number applications, e.g. for games that need to shuffle cards/simulate dice rolls, etc.

In this case, use the random number generator that comes with your language's standard library, seeded with the current time. If you need the random number to be within a certain range, take the modulus and add the base of the range, e.g. in C:

/*
 * Get a random number in range 100 to 110, inclusive.
 * There are 11 numbers to choose from (110 - 100 + 1 = 11).
 */
int x = (rand() % 11) + 100;

This will work fine so long as the low-order bits of your implementation are as random as the high-order bits, which they will be in any half-way decent implementation. If you're worried about the quality of your standard library's implementation, check out the GNU Scientific Library's random number facilities.

If your timer resolution is in seconds and you're worried about people getting the same seed by running two instances of your program in quick succession, combine the current time with the process ID. Alternatively on UNIX you can use /dev/urandom to generate either a seed or a stream of pseudo-random numbers.

You may also want to consider (i) printing out the seed you use and (ii) adding a flag to set the seed to an explicit value. This will allow you to reproduce problems that may only occur when a certain sequence of numbers is generated.

...when personal data is at stake


This will be the case for cryptographic applications where bank details, passwords and the like are being protected. These normally involve a small quantity of random numbers being used once to generate a key, so performance is rarely an issue.

In this case, using a true random number generator is essential. On most UNIX systems you can use /dev/random for this purpose, though if you need a large quantity of random numbers this may take some time—encourage the user to wiggle the mouse in order to generate more hardware entropy (seriously).

If your platform doesn't have such a facility, random.org can produce truly random bytes over a secure HTTPS connection. And if this isn't an option and a true random number generator isn't available, /dev/urandom is probably the safest fallback.

References


No comments:

Post a Comment