Elke asked about the motivation behind my latest post on Not Banjaxed, concerning how, when institutions are ranked according to some criteria, the smaller institutions tend to be found both at the top and at the bottom of the rankings. This behavior, described by what mathematicians call the “Law of Large Numbers” is due mainly to the increased variability one associates with smaller samples.
Since the question is interesting enough to warrant a more detailed response, but it’s not really work related it made sense to respond over here. Or perhaps I have been on the Internet too much lately and am following suit and turning my blogs into click-bait!
As is often the case that post in question came out of a casual conversation I had recently with a friend. His child was doing some math homework for school and was asked to find some way to electronically model the rolling of a pair of dice—something that happens in many board games. The child had chosen to use Microsoft Excel and had simply used the random function to generate random integers from 2 (the lowest you can get on a pair of dice) to 12 (the highest).
I told him that was not correct and then proceeded to explain why. That, in turn, led to a discussion of how what seems simple and straightforward often is not. Since we both had significant experience teaching in rural schools I used the example in the other post to prove my point.
So, a casual conversation led, first to one post, and now to another.
Since I brought up the problem of simulating the tossing of two dice I might as well give you the same explanation.
Tossing two dice does give sum (total) between 2 and 12 but it is not the same as choosing a random number between 2 and 12 because not all of the outcomes are equally likely.
Let’s model all of the outcomes using a table. The numbers in bold represent what is on the face of each die. Each cell shows the total for the toss.
Table 1: All of the outcomes if two dice are tossed.
Notice that based on the frequencies in the table, not all of the outcomes are equally-represented. For example, there is only one way to get a Two, namely, by rolling double ones. Likewise, there is only one way to get a Twelve. Other outcomes, “Lucky Seven,” for example, are more frequent.
A histogram that shows the frequencies of each outcome demonstrates this behavior clearly:
The histogram shows the frequency of all of the possible outcomes. It’s clear to see that they are not equally likely. As already mentioned, only one combination can give either a Two or a Twelve. Each of the other outcomes, though, can be obtained in many ways, with Seven being the most frequent of all.
The histogram above shows the frequency of the various types of outcomes. This can be used, in turn, to predict the theoretical probability of obtaining that outcome if the dice were actually rolled. Probability is determined by dividing the number of favourable outcomes by the total number of outcomes.
Let’s find the probability of obtaining “Lucky Seven” for example. You can see from the histogram that there are 6 ways of getting an outcome of 7. If you check Table 1 you will notice that there are a total of 36 possible outcomes. To get the probability of obtaining an outcome of 7 you just divide the two, that is P(7) = 6/36 or approximately 0.17 (it’s actually a repeating decimal.) If you do this for all of the outcomes then you can recast the histogram as a bar graph showing the probability of any particular outcome.
The model chosen by my friend’s child did not take this into account. Choosing a random number between 2 and 12 assumes that all of the outcomes would be equally likely, which they are not.
So, what’s a simple solution to the problem?
Easy. Just recall the way that table 1 was structured: the rows across the side showed the results for die one and the columns showed the results for die 2. Each of the two rolls had no bearing on the other; that is we consider them to be independent events.
So, rather than using the random number generator to select a number from one to twelve, all you do is use the same generator to select two numbers from one to six and then add them.
Of course you know I could not resist the urge to try it out. I opened Excel to a blank worksheet and entered this formula in 1000 cells: =RANDBETWEEN(1,6)+RANDBETWEEN(1,6)
It would not be a great idea to try and show the 1000 numbers in the space below, as the result would not render well on a mobile. I just took a screen capture instead. The image below shows the numbers.
I then used the data analysis feature to plot a bar graph of the experimental results. The results are shown below.
If you compare figure 4 (the experimental probability graph) to figure 3 (the theoretical probability graph) you will notice that there’s a close but not exact match. That’s because 1000 trials of the experiment is not really enough to smooth things out! Trying it 10,000 times would have been more like it, but I’m sure you get the idea. The model is pretty good.
So what’s the point? This: once again, intuition is not very effective when you are working with numbers. In this case, it did make a lot of sense to think that generating random numbers between 2 and 12 would have done a reasonably decent job of simulating the toss of two dice. It was not, however, a good idea as the act of rolling two dice has a total of 36 separate outcomes, which, in turn, generate 11 different results—from 2 to 12. Unfortunately the likelihood of each outcome was not the same so a simple random number generator would not work. Instead we had to model the two rolls separately and then add them up. We did see, though, that this simulation gave results that modeled the ideal.
As an added bonus we also saw that a fairly large number of trials—1000 in this case—still did not give anything close to a perfect match to the ideal distribution, thus, once again demonstrating just how much randomness can affect even simple situations.