Playing efficiently, information and entropy
Maybe you played in your childhood at this two players guessing game: each player choose a face from $20$-ish possibilities while asking yes-no questions to the other player about his own choice. In french, it was "Qui est-ce ?", and it seems to be "Who's who" for english folks.
Everybody knows how to play this game, but there should be a best way to play it, right ?
First, a bit of theory on information and entropy.
Information
Information in Shannon's terms is a way to quantify the amount of information there is in a given event of probability $p$ .
The unit of information depends of the base in which we calculate it. The usual base we work in is base $2$, which yields information in bits or Shannons and is notated $I$.
One advantage of base two is the meaning we can intuitively grasp in a given bit of information. Before going into equations, a with the hands explanation : $1$ bit of information splits the space of possibilities in half, $2$ bits in four, $3$ in eight, and so on.
E.g. Consider a coin toss, unbiased. The side it lands on is either one of two possibilities, with equal probability. We model it as the random variable $X$ with uniform discrete probability distribution.
$$P(X=\text{head}) = P(X=\text{tail}) = 1/2$$
Once we observe the final outcome, we gain information based on which side it landed on. We went from a space of possibles outcomes of dimension 2 to a unique outcome. We reduced the space size by a factor two (from $2$ to $1$), which is the same thing as saying we gained $I =1$ bit of information.
The example can be too trivial to really get the concept, so consider 2 tosses of the same coin. We now have 4 possibles outcomes, represented by the values that can take the random variable $Y$.
$$ \begin{aligned} P(Y=\text{[H, H]}) = 1/4\cr P(Y=\text{[H, T]}) = 1/4\cr P(Y=\text{[T, H]}) = 1/4\cr P(Y=\text{[T, T]}) = 1/4\cr \end{aligned} $$
We are starting here with a possible outcome space of dimension 4. Once we toss the first coin, as in the first example, we will gain 1 bit of information, splitting the outcome space in two.
Then, the second toss will also split the space in two, leading to a single outcome.
You see, information is additive when events are independent. Flipping the coin two times gave us in total $I = 2$ bits of information from the original possibility space.
This can be nicely described with this equation, where one more bit of $I$ splits $p$ in half. $$\left(\frac{1}{2}\right)^I=p$$
By rearranging, we deduce the expression of $I$ in function of $p$, the outcome space splitting factor (or probability of that event). $$I = -\log_2(p)$$ That's the definition of Shannon's Information.
Entropy
We will also need the Information Theory definition of entropy. Here, entropy measures the uncertainty of a given random variable.
If we take our r.v. $Y$ that outcomes the result of two consecutive tosses, we feel that there is more uncertainty in it than, say, $X$. That's because we expect more information coming from $Y$ than $X$ in average. Entropy measures it as the expected information for a given random variable.
Over the possible outcomes indexed by $i$ of probability $p_i$, the entropy $H$ is the expectation of the informations $I(p_i)$.
$$H = \sum_i p_i I(p) = -\sum_i p_i \log_2(p_i)$$ Let's compute the entropy of our previous variables $X$ and $Y$.
$$ \begin{aligned} H(Y) &= 4 \frac{1}{4} \log_2(4) = 2 \text{ bits}\cr H(X) &= 2 \frac{1}{2} \log_2(2) = 1 \text{ bit}\cr \end{aligned} $$ You probably saw that entropy for uniform distribution seems overly complicated for what is is. Well, that's because the uniform distribution is a special case, it maximizes this function. You can't have a distribution probability with $N$ outcomes with greater entropy than the uniform distribution. It's a fundamental limit, and leaded to the well known Shannon's source coding theorem.
A thing worth noting is that entropy indicates how many Yes/No questions (that split the possibilities in 2) we should ask in order to identity on particular outcome.
Ok, enough of theory, game time.
Who's who?
Setting up
Who's who has $20$ different characters, with shared attributes likes glasses, hair colors, etc... The goal is to be the first to find the other player character by asking yes/no question about it.
E.g. "Are you a girl ?"
E.g. "Do you have a magnificent mustache ?"
We won't play the actual game, because I'm not bored enough to extract each characteristics of the copyrighted game, but we will play a similar game.
First I need to build a set of characteristics, but that shouldn't be uniform. Where to find real distributions ? In real life ! For the sake of demonstration, I will get my source of entropy by processing lyrics of Never Gonna Give You Up by Rick Astley.
Those lines read the lyrics from a text file, cleans it a bit and computes the occurrences of each character.
c = Counter(''.join(open("lyrics", "r").readlines())
.replace('\n','')
.replace(' ','')
.replace(',','')
.lower())
Then, by sorting in descending order, we get the most occurring letters.
d = sorted(dict(c).items(), key=lambda x:x[1])[::-1][:10]
[('n', 201),
('e', 197),
('o', 162),
('a', 111),
('r', 91),
('u', 84),
('g', 81),
('y', 72),
('t', 69),
('v', 57)]
Let's plot the probability (in frequency terms) of each letter by computing the ratio of the number of occurrences over the number of total letters.
x = [k[0] for k in d]
y = [v[1] for v in d]
plt.bar(x, list(map(lambda l: l/sum(y), y)))
Since we learned about information, lets visualize each letter's information.
Information is greater when probability is lower, that's expected.
Why did we do that actually ? Oh yes, in order to get not uniform probabilities to play a game. So let's setup our game.
We need something to guess for, a character, which has a set of attributes. Our characters will have a fixed number of attributes, which I choose to be $5$. Attributes are exclusive, which means that you can't have two of the same attribute. In total, we have the same number of possible attributes as there are letter in Never Gonna Give You Up.
This number is:
len(d)
> 26
Each attribute will have the same information of those letters. To make this even more confusing, my attributes will be identified by letters from $a$ to $z$, which is funny because there will be 26 attributes but for the wrong reasons. To sum up, our character is something like this.
$$\text{Bob} = [\text{a}, \text{d}, \text{h}, \text{z}, \text{e}]$$ $$\text{Alice} = [\text{c}, \text{g}, \text{t}, \text{o}, \text{m}]$$
And information of each attribute will follow this distribution:
Time to create characters. I will sample $20$ characters by sampling randomly without replacement $20$ vectors of $5$ attributes. First, utility functions :
prob = list(map(lambda l: l/sum(y), [v[1] for v in d]))
attr = list(string.ascii_lowercase)
def get_character(prob, attr):
return list(np.random.choice(a = attr, size = 5, p = prob, replace = False))
Call it $20$ times.
characters = [get_character(prob, attr) for _ in range(20)]
> [['d', 'f', 'l', 'k', 'g'],
['c', 'k', 'a', 'd', 'q'],
['r', 'j', 'l', 'h', 'u'],
...
Ok, we have our characters. So we already can play the game, ask someone to choose a character, hidden from you, and ask him questions like:
E.g. "Do you have an $a$ ?"
E.g. "Do you have an $h$ ?"
After each question, you hopefully reduce your options, gaining information from each question outcome, and reducing entropy of your position.
Actually, we know the value of entropy we are starting with. If your friend is picking at random, we really have no clue about which character he picked. So we are in a uniform distribution case with $20$ possible outcomes. The entropy hereby is:
$$ H(X) = 20 \frac{1}{20} \log_2(20) \approx 4.32 \text{ bits} $$
Here, I give you two utility functions to compute information and entropy.
# entropy
information = lambda p: -np.log2(p)
entropy = lambda p: np.sum(p * information(p))
Playing a dumb game
The game goal is here simple, we have $4.32$ bits of uncertainty and we can get information by asking questions. Every information we get from answers reduces the current entropy. Once the sum of the informations we got reaches $4.32$, we will be left with $0$ bit entropy, that is two say: only one possible case, our friend's pick.
Before thinking about what we should do to win as quick as possible, let's play like a child: ask random questions. I will use it to demonstrate what I said above: the interaction between information and entropy.
I start by coding a function that asks a question, and another one that gives the answer.
def ask_question():
q = np.random.choice(a = attr)
print(f'Do you have {q}?')
return q
def get_answer(q, character):
ans = q in character
print("Yes" if ans else "No")
return ans
Then, I ask a friend (which happens to be named np.random
), to choose one of the characters we created earlier.
character_to_find = characters[np.random.randint(len(characters))]
And I ask my first question.
q = ask_question()
> Do you have m?
My CPU, kind enough, answers.
a = get_answer(q, character_to_find)
Yes.
Ok, we got lucky.
We found, at first try, an attribute. But how much information is it worth ? Since each character are sampled in the same manner, we already have the information by looking at the last information graph. But a more realistic scenario is to consider that we do not have this knowledge. All we can do it so estimate it from what we have : the $20$ characters in front of us.
I need to compute the probability of finding m in the characters, so I will simply count how many have an m among the 20 and divide by 20 to get the probability (frequentist way).
def get_prob_of_attr(attr, characters):
return sum(attr in c for c in characters) / len(characters)
information(get_prob_of_attr(q, characters))
> 2.322 # bits
Asking this question gave us $2.32$ bits of information, which, from $4.32$ should leave us with an entropy of $2$ bits. And, if you remember the first part, $2$ bits was the entropy of a $1$ in $4$ possible outcome. If all is good, there should be only 4 characters left after this question. Better check.
possible_characters = [c for c in characters if 'm' in c]
> [['d', 'm', 'a', 'v', 'h'],
['a', 'u', 'g', 'c', 'm'],
['m', 'b', 'a', 'p', 'i'],
['c', 'i', 'm', 'j', 'w']]
All right, everything makes sense. Let's ask another question.
q = ask_question()
> Do you have g?
a = get_answer(q, character_to_find)
No.
Hmm, is this useful ? It should, because the second of remaining characters has a $g$, so we now know it is not him, there is indeed some non zero information:
p = get_prob_of_attr(q, current_characters)
information(p if a else 1-p)
> 0.415 # bits
We are bringing this entropy down, it should be around $2 - 0.4 = 1.6$ bits now.
We are not done yet, we still have $3$ possibilities.
current_characters = [c for c in current_characters if 'g' not in c]
> [['d', 'm', 'a', 'v', 'h'],
['m', 'b', 'a', 'p', 'i'],
['c', 'i', 'm', 'j', 'w']]
Enough of random guesses, I want to play too. I will ask my own question : does it have an $a$ ?
a = get_answer('a', character_to_find)
Yes.
Dang ! I feared it, I'm still left with $2$ outcomes now. Wait, 2 outcomes meant an entropy of $1$ bit before. This should also mean that the information I gained from asking this question had to bring $0.6$ bits to get to $1$ of entropy (we were at $1.6$ previously).
p = get_prob_of_attr('a', current_characters)
information(p if a else 1-p)
> 0.584 # bits
Yes, that's it. So, $1$ bit is just one Yes/No question away from $0$, this is the last question ! By the way, doing this little observation, you just computed $-\log_2{2/3}$ in your head, congrats.
current_characters = [c for c in current_characters if 'a' in c]
> [['d', 'm', 'a', 'v', 'h'],
['m', 'b', 'a', 'p', 'i'],
I just have to pick a letter that is not in both characters to get $1$ bit of information, and I will win.
In fact, all the following questions win (have $1$ bit of information).
[l for l in current_characters[0] if l not in current_characters[1]]
> ['d', 'v', 'h']
The chosen one was ['d', 'm', 'a', 'v', 'h']
.
I hope you now better understand how intuitive it is to work with entropy and information instead of probabilities. They are indeed two different way to do the describe the same events, but I personally prefer working with additions and subtractions than with probability products. Also, I know what $8$ bits information represent, but not in probability : $0.00390625$ (even it describes the same thing).
Playing a smart game
We are done with guessing randomly attributes, we want to be efficient. How can we know what attribute to ask for in order to gain much information, every time ?
The problem is two fold :
- Getting the maximum of information: one idea could be to ask if the character has a very rare attribute, which would have a lot of information if true. But this is risky, if it indeed has this attribute, you gain a lot of information, but in the other case, very little.
- Every time: we want an average strategy, which ask useful question every time from what we know.
We need some sort of average of information... Entropy ! If we compute the entropy of every attribute with respect to every remaining characters, we are computing the average information each attribute is actually worth for, in average.
Let's actually implement it:
def get_entropies(current_characters):
entropies = {}
for a in attr:
informations = []
for c in current_characters:
p = get_prob_of_attr(a, current_characters)
informations.append(information(p if a in c else 1-p))
entropies.update({a : sum(informations)/len(informations)})
return sorted(dict(entropies).items(), key=lambda x:x[1])[::-1]
> [('g', 1.0),
('b', 1.0),
('c', 0.9927744539878083),
('i', 0.9709505944546686),
('d', 0.9709505944546686),
...
('x', 0.0),
('t', 0.0),
('q', 0.0)]
For our given set of characters, it seems like we should always start by asking $g$ or $b$ questions. And that there is no point of asking $x$, $t$ or $q$ (because no character possess them as attributes).
Let's play the game, with the same character chosen, but with another strategy : always pick the question that has higher entropy. I will write a loop in order to implement this.
current_characters = characters
while len(current_characters)>1:
entropies = get_entropies(current_characters)
q = entropies[0][0]
a = get_answer(q, character_to_find)
p = get_prob_of_attr(q, current_characters)
if a: current_characters = [c for c in current_characters if q in c]
else: current_characters = [c for c in current_characters if q not in c]
h = entropy(np.ones(len(current_characters))/len(current_characters))
print(f'Your word was {current_characters[0]} !')
And now, fingers crossed, we run it.
> Do you have g ?
No
We got 1.0 bit(s) of information.
Entropy is down to 3.32 bit(s).
> Do you have b ?
No
We got 1.0 bit(s) of information.
Entropy is down to 2.32 bit(s).
> Do you have j ?
No
We got 0.7369655941662062 bit(s) of information.
Entropy is down to 1.58 bit(s).
> Do you have d ?
Yes
We got 0.5849625007211563 bit(s) of information.
Entropy is down to 1.00 bit(s).
> Do you have v ?
Yes
We got 1.0 bit(s) of information.
Entropy is down to 0.00 bit(s).
Your word was ['d', 'm', 'a', 'v', 'h'] !
This is the correct word, and found in $5$ questions ! This is still more than our first example where, I remind you, we gained $2.32$ bits of information by chance. Here, we are not greedy: we have the best strategy in average.
It is fun to check the average number of questions it takes to get the right answer.
Over $1000$ games with different characters, our strategy wins in either $4$ or $5$ questions. Every time. That's neat !
In comparison, if we play randomly, this is what happens: Yea, this is bad, but nobody plays like this (I hope !)
Thank you for reading this far and I hope you liked this little run-through about information theory and how to play games efficiently !
Writing this blog post was triggered by 3Blue1Brown's video about solving efficiently the Wordle game, I really liked his approach on the subject and wanted to deep dive into the method myself, it is done now !