2016-02-20

Analysing the blue-red hat problem in the face of user error

Everyone knows computers are getting smarter - unless they're being programmed by a major corporation for a government contract - but there has recently been another leap in the level of smart. DeepMind (now part of Google) has built an AI that has successfully deduced the optimal solution to the hat problem:

100 prisoners stand in line, one in front of the other. Each wears either a red hat or a blue hat. Every prisoner can see the hats of the people in front – but not their own hat, or the hats worn by anyone behind. Starting at the back of the line, a prison guard asks each prisoner the colour of their hat. If they answer correctly, they will be pardoned [and if not, executed]. Before lining up, the prisoners confer on a strategy to help them. What should they do?
Tricky, n'est ce pas?

The obvious part first: the first prisoner to answer, whom we'll designate number 1, has no information about his hat colour. Assuming blue and red hats are assigned with equal probability, he can answer either "red" or "blue" with a 50% chance of success and 50% chance of getting executed; he has no better strategy for self-survival. What about the other prisoners?

Applying information theory, our system has 100 binary bits of state - 100 people, each with 1 bit of state relating to whether their hat is blue or not. We generate 99 bits of knowledge about that state as the hat-wearers give answers. So the maximum we can expect to discover reliably is 99/100 hat values. How can we get close to this?

If everyone just guesses their own hat colour randomly, or everyone says "blue", or everyone says "red", then on average 50% of people survive. How to do better? We need to communicate information to people further down their line about their hat colour.

Let's get the first 50 people in line to tell the next 50 people in line about their hat colour. Person 1 announces the hat colour of person 51, person 2 of person 52 and so on. So the last 50 people are guaranteed to survive because they have been told their hat colour. The first 50 people each have a 50-50 chance of survival because the colour they "guess" has no necessary relation to the colour of their hat. On average 25 of them survive, giving an average survival of 75% of people.

The DeepMind algorithm relies on an insight based on the concept of parity: an 0/1 value encapsulating critical state, in this case the number of blue hats seen and guessed, modulo 2. The first user counts the number of blue hats seen and says "blue" if that number is even, and "red" if odd. He still has a 50-50 chance of survival because he has no information about his hat. The second user counts the number of blue hats. If even, and the first person said "blue", then he and the first person both saw the same number of blue hats - so his hat must be red. If even, and the first person said "red", his hat must be blue because it changed the number of blue hats seen between the first person and him. Similar reasoning on the odd case means that he can announce his hat colour with full confidence.

What about person 3? He has to listen to person 1 and person 2, and observe the hat colours in front of him, to deduce whether his hat is blue; his strategy, which works for all others after him too, is to add the parity values (0 for blue, 1 for red) for heard and seen hats modulo 2, and if 0 then announce "blue", if 1 then announce "red". Follow this down the line, and persons 2 through 100 are guaranteed survival while person 1 has a 50-50 chance, for an average 99.5% survival rate.

Of course, this is a fairly complicated algorithm. What if someone mis-counts - what effect does it have? We don't want a fragile algorithm where one person's error can mess up everyone else's calculations, such as with "Chinese whispers." Luckily, a bit of thought (confirmed by experiment) shows us that both the future-casting and parity approaches are resilient to individual error. For future-casting, if one of the first 50 people makes an error then it makes no difference to their chance of survival but their correspondent in the second half of the line is doomed. If one of the second 50 people makes an error then they are doomed unless their correspondent also makes a mistake - generally unlikely, a 10% chance. So if 10% of users make errors then the approximate number of survivors is (75 - 10) + 1, i.e. 66%.

Surprisingly, the parity approach is also robust. It turns out that if user N makes a mistake then they doom themselves, and also doom user N+1 who relies on user N's calculation. But because both user N and N+1 make erroneous guesses, this brings the parity value back in line for user N+2, whose guess will be correct (absent any other errors). So the approximate number of survivors given a 10% error rate is 99.5 - 10*2 = 79.5%

Here's Python code to test the various algorithms: save it as "hats.py" and run it (e.g. "chmod 0755 hats.py ; ./hats.py" on OS X or Linux). It runs 10 trials of 100 people and reports the average number of survivors, based on a 10% error rate in hat wearers following their strategy. Default strategy is the parity algorithm.

#!/usr/bin/python

import random

person_count = 100
half_person_count = person_count / 2
# Hat choices
hat_choices = ['r','b']
hat_opposite = {'b':'r', 'r':'b'}
# 10% error rate in guesses
error_rate = 0.1

def guess_constant(heard_guesses, seen_hats):
    return 'b'

def guess_random(heard_guesses, seen_hats):
    return random.choice(hat_choices)

def guess_future(heard_guesses, seen_hats):
    """ First half of list calls out hat of correspondent in second half of list """
    full_list = heard_guesses + ['x'] + seen_hats
    my_index = len(heard_guesses)
    if my_index < half_person_count:
        # Call out the hat of the person in the second half of the list, hope same as mine
        return full_list[my_index+half_person_count]
    else:
        # Remember what was called out by my corresponding person in first half of list
        return heard_guesses[my_index - half_person_count]

def guess_parity(heard_guesses, seen_hats):
    """ Measure heard and seen parity of blue hats, call out blue for even, red for odd."""
    heard_blue_count = len([g for g in heard_guesses if g == 'b'])
    seen_blue_count = len([s for s in seen_hats if s == 'b'])
    if (heard_blue_count + seen_blue_count) % 2 == 0:
        return 'b'
    else:
        return 'r'

def run_test(guess_fun):
    hat_list = [ random.choice(hat_choices) for i in range(0, person_count) ]
    print "Actual: " + "".join(hat_list)
    answer_list = []
    score_list = []
    error_list = []
    correct = 0
    for i in range(0, person_count):
        guess = guess_fun(answer_list, hat_list[i+1:])
        if random.random() < error_rate:
            guess = hat_opposite[guess]
            error_list.append('X')
        else:
            error_list.append('-')
        answer_list.append(guess)
        if guess == hat_list[i]:
            correct += 1
            score_list.append('-')
        else:
            score_list.append('X')
    print "Called: " + "".join(answer_list)
    print "Score:  " + "".join(score_list)
    print "Errors: " + "".join(error_list)
    print "%d correct" % correct
    return correct

if __name__ == "__main__":
    trial_count = 10
    correct_total = 0
    for i in range(0, trial_count):
        print "\nTrial %d" % (i+1)
        correct_total += run_test(guess_parity)
    print "\nAverage correct: %d" % (correct_total / trial_count)
You can change the "guess_parity" value in the run_test() invocation on the penultimate line to "guess_future" for the "warn the second half" strategy, or "guess_random" for the random choice.

This is a lousy problem for use in software engineering job interviews, by the way. It's a famous problem, so candidates who have heard it are at a major advantage to those who haven't. It relies on a key and non-obvious insight. A candidate who hasn't encountered the problem before and solves it gives a very strong "hire" signal, but a candidate who fails to find the optimal solution could still be a valid hire. The least worst way to assess candidates based on this problem is whether they can write code to evaluate these algorithms, once the algorithms are described to them.

No comments:

Post a Comment

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.