If anyone tells you he or she understands probability theory, do not believe them. That person simply does not know enough of it to admit, that probability theory is riddled with paradoxes, where common sense must step aside and wait in silence, or your brain will hurt. Substring statistics is probably among the lesser-known, yet magically unintuitive examples.

Consider a sequence of random coin flips. Each coin flip is either a "heads" or a "tails", hence the sequence might written down as a sequence of `H`

and `T`

-s: `HTHTHHTT...`

It is easy to show that the probability of the sequence to begin with, say, `HHH`

is equal to P(`HHH`

) = 1/8th, as is the case with any other three-letter combination: P(`HHT`

) = P(`THH`

) = P(`THT`

) = 1/8, etc. Moreover, by symmetry, the probability of seeing a particular three-letter combination at *any fixed position* in the sequence is still always 1/8-th. All three-letter substrings seem to be equivalent here.

But let us now play a game, where we throw a coin *until we see* a particular three-letter combination. To be more specific, let us wait until either `HHT`

or `HHH`

comes up. Suppose I win in the first case and you win in the second one. Obviously, the game first proceeds until two heads are flipped. Then, whichever coin flip comes up next determines the winner. Sounds pretty fair, doesn't it? Well, it turns out that, surprisingly, if you count carefully the *expected number of coin flips* to obtain `HHT`

, it happens to be 8, whereas for `HHH`

it is 14! Ha! Does it mean I have an advantage? Suprisingly again, no. The probability of `HHT`

occuring before `HHH`

in any given sequence is still precisely 0.5 and, as we reasoned initially, the game is still fair.

We can, however, construct even more curious situations with four-letter combinations. Suppose I bet on `HTHT`

and you bet on `THTT`

. The expected number of coin flips to obtain my combination can be computed to be 20. The expected number of flips to get your combination is smaller: 18 flips. However, it is still *more probable* (64%) that my combination will happen before yours!

If this is not amusing enough, suppose that four of us are playing such a game. Player A bets on the string `THH`

, Player B bets on `HHT`

, player C on `HTT`

and player D on `TTH`

. It turns out that A's combination will occur earlier than B's with probability 75%. B's combination, however, wins over C's with probability 66.7%. C's combination, though, wins over D's with probability 75%. And, to close the loop, D wins over A with probability 66.7%! This is just like the nontransitive dice.

Hopefully, you are amazed enough at this point to require an explanation for how this all might happen. Let me leave it to you as a small puzzle:

- Explain in simple terms, how can it happen so that the
*expected time to first occurrence*of otherwise equiprobable substrings may be different? - Explain in simple terms, how can it be so that one substring has higher than 50% chance of
*occuring earlier*than some other substring. - Finally, explain why the two effects above are not strictly related to each other.

PS: The theory used to compute actual probabilities and expected times to occurrence of a substring is elegant yet somewhat involved. For the practically-minded, here is the code to check the calculations.

Valuable information. Fortunate me I found your site by

accident, and I'm shocked why this twist of fate didn't took place earlier!

I bookmarked it.