# The 2+ year battle between ScienceCowboy and Jewel Quest 2 (demo version)

When I got my cell phone over two years ago it came pre-installed with a demo version of a game called Jewel Quest 2. The demo version allows you to play the very first level, then closes the game. This first level, like most first levels, was very easy. Too easy for me to buy the full version. But I would often find myself with a few minutes to kill, so I have been playing the demo version of this game regularly for over two and a half years now. How and Why would I do such a thing? Read on.

Jewel Quest 2 is a type of “match-three” game, based on the game Shariki. Bejewelled is another such game. The way the game works is simple. You swap the position of two adjacent items, and in doing so you must create either a row or column that contains three (or more) of the same item consecutively. When you do, those items disappear and everything above them falls down to fill their places, with new items appearing at the top. The level ends when you have made an item disappear from every space on the board or the time runs out, whichever comes first. You can try playing the game for free on the web here (this web version had 5 types of items instead of the 4 on my phone).

As I continued to play I would make the game more challenging by giving myself new restrictions on how I could play. First I decided I would try to complete the level as fast as I possibly could. Pure speed. After I tired of this, I changed my objective to completing the level in as few moves as possible. This required more planning, so that making one move cause chain reactions on the board. I eventually tired of this and started instead placing limits on *where* I could make moves. Can’t make any moves in the top two rows. Then in the top four rows. And so on. The hardest version so far is that I can only make moves within rows three and four from the bottom (not in the bottom two rows nor in the top four). This is pretty difficult. The only way you can reach the bottom row of a column is to first get the item in the second row to match the item in the bottom row (remember you can’t touch the second row directly), then move another of that item into the third row above them to complete the set.

Sometimes you would get lucky and would have a few columns like this at the start of the game. Let’s call when you start with the bottom two row the same kind a ‘success’. In the screenshot above I started this game with 3 such successes. So I wondered what the chances were of starting with 0, 1 ,2, 3, or more ‘successes’. The tricky part is that the starting board cannot have any 3-in-a-row (or column) because they would of course disappear. So at the start of the game the contents of each space are not *statistically independent* of each other. If there are two in a row, the next one *can’t* be that kind. And this works both side to side and up and down.

What I *think* is happening is that each of the four kinds is equally likely, unless there are already 2 in a row or column preceding it. In that case, the kind in that space is equally likely from the remaining possible kinds. If I’m right about this, I can think of two ways to calculate the probability of having 0,1,2,3, or more ‘successes’ at the start of the game. One way is to simply do the mathematical calculation. I won’t tell you about how this is done now, but I may in another post.

The other way is to write a computer program that draws a starting game board according to the rules I think it is using. The program is simple. It enters an empty space and says “This could be any of the four kinds.” But before drawing the kind of item for a space it asks “Are the two spaces above the same as each other? If so, don’t use that kind for this space,” and then asks the same question about the two spaces to the left. Then it draws the kind to put in my new space from the possible kinds remaining. It then moves to the next empty space and repeats, until the board is drawn. After you’ve drawn a board, the program counts the number of our ‘successes’ and stores that number, and does this over and over.

This gives me the probability that I will get 0, 1, 2… successes if I play by the rules I laid out. They can be shown in a histogram for visualization. The tallest bars are for 1 and 2 successes, which means that these are the most likely outcomes under my rules. Likewise the bars are very small for 4, 5, or 6 successes, meaning they are very unlikely.

Now that I know what I should expect *if* I am right, I need some data from the actual game itself. I started the game on my phone 200 times and counted the number of ‘successes’ each time. This is much more boring than actually playing the game, but is the only way to test my theory. I can put the results into a histogram to compare visually to the simulations. Looks pretty close.

I’ve put the results into a table, where the outcomes of the real game are called “observed.” The simulations I ran gave me probabilities, and from them I know how many of each kind of success I expect if I play 200 times. I put these into the table called “expected.” Looking at the table, the observed and expected don’t look too different. This looks pretty good for my theory, but eyeballing it isn’t god enough. We need to do a statistical test.

The question we want to ask is “Are the results of the real game very different from what I would expect if I was right about the rules of the game?” Another way to phrase the question is to ask “Could my rules have produced the data we observed playing 200 games?” There are several statistical tests I can use to ask these questions, and I don’t have the space here to describe them. But the results of the tests showed that the data we saw from the game *could* have reasonably been produced by my rules. What it really says is that what we observed from the game is not different *enough* from our expectations that we can prove me wrong.

While this means the evidence supports my theory, *it does not prove I was correct.* It is possible that the rules of the game are really slightly different, but that 200 games is not enough to prove I was wrong. The bigger the sample size gets, the more powerful the statistical tests become to distinguish subtle differences. This is an important fact about statistical tests: small samples do not always reflect the truth. This is one reason why political polls are often misleading (there are several other reasons as well).

Here’s one way I could have been wrong: Perhaps on this easy demo level you are more likely to get large clusters of the same kind of gem than you would expect at random. For example, if you had two diamonds in a row, we know the next one cannot be a diamond, but maybe the *next* next one is *more* likely to be a diamond, to make it easier for us to complete the move. This could change as the levels get harder. I could test this as well. I can easily simulate with my computer program what I should get, so I would just need to collect more data from the actual game on how the gems cluster.

I’ll leave that task up to you. I’m too busy creating harder versions of this demo level for myself.