As has been widely reported throughout both the poker and tech media, the Brains vs. Artificial Intelligence challenge wrapped up this weekend. It was an apparent human victory, as the four-man team of Douglas Polk, Dong Kim, Bjorn Li and Jason Les coming out ahead of the AI “Claudico” to the tune of $732,713 dollars in play money.
No sooner had the results been announced, however, than the team of computer scientists behind Claudico declared the margin of victory to be a “statistical tie.” The general feeling in the poker world is still that the humans won.
Who’s right? Both, and neither. What’s actually going on here is the standard clash of cultures between academia and other walks of life.
What is a statistical tie?
At the heart of the disagreement is a fundamental difference in expectations. Scientists are obsessed with rigorous precision, while gamblers embrace uncertainty and small edges.
In terms of the sort of calculations we make, poker players and other forms of skill-based gamblers are used to thinking of 55% as being good enough and 60% being great. As a gambler, if you wait until you’re nearly certain of something to pull the trigger, you will likely have missed the window of opportunity.
Conversely, in science, it’s important to wait until you’re sure of things before making any claims: in the world of peer review, if you have a tendency to go off half-cocked, your reputation will quickly suffer.
What the researchers mean when they call the results a “statistical tie” is this: Assuming that Claudico was in fact equal to the human players, the results still would have come about by chance some percentage of the time. If that percentage is greater than the margin of error that the researchers set out in advance, then they can’t call the results meaningful. A standard margin of error is 5%, and I’ve confirmed with Carnegie Mellon that this is what the researchers were shooting for.
Thus, when the researchers claim a statistical tie, what they really mean is that they can’t say with more than 95% confidence that the humans were actually better. The rest of us probably don’t need to feel quite that confident in order to be happy calling it a human win, and that’s all that this boils down to.
How much did the humans win by?
Okay, so “probable human win” vs. “statistical tie” comes down to semantics and a difference in expectations. It’s not really productive for anyone to argue about that, so let’s instead take a look at the numerical results to see how the humans actually did.
The blinds for the competition were $50/$100, so the $732,713 margin of victory represents 7327 big blinds. Each player played 20,000 hands against the computer, for a total of 80,000. Their collective average win rate was therefore 9.15 bb/100, using the standard measure of cash game performance. (Note that we’re using lower-case “bb” to denote big blinds, to avoid confusion with “BB” for big bets, a holdover from the days when Limit games were more popular.)
Although low-stakes win rates can run into the double-digits, 9.15 would constitute an extremely good long-term win rate for high-stakes online players like Polk et al. Assuming Claudico was remotely in their league to begin with, you wouldn’t expect them to do much better than this, and you certainly wouldn’t have expected Claudico to beat them by more than that. I’m not sure how it was that 80,000 hands was what was negotiated for the match, but if that result is not enough to count as a win over that volume, it seems like the experiment was all but certain to produce a statistical tie from the get-go.
What’s the variance like?
The next question is what sort of variance we expect over 80,000 hands. According to a blog post by Noah Stephens-Davidowitz, a typical win rate and standard deviation for 6-max cash game player would be 5 +/- 90 bb/100. That means that the player will average 5 bb/100 in the long term, and will fall somewhere between -85 to +95 in about two-thirds of 100-hand samples.
He then simulates the results of such a player over a 50,000 hand sample, and finds that the player will turn a profit 89% of the time. In other words, it’s only 11% likely that he will be more than 5 bb/100 below his average over that sample. Eyeballing the graph he provides, it looks like the odds of being 9 bb/100 below average would be more like 1%.
Of course, heads-up play is considerably higher-variance than 6-max, so using Noah’s results isn’t really fair to Claudico. On the other hand, the researchers took special measures to try to eliminate at least one form of variance in the experiment by splitting the humans into two groups and dealing each group the cards that were dealt to Claudico against the other group. This means that neither humans nor AI could get hot- or cold-decked in terms of hole cards, but it’s hard to know how much this actually reduces the overall variance.
Still, if we offset the increased variance of heads-up play against the slightly larger sample size (80,000 vs. 50,000) and the duplicate deals, then it feels like the odds of this result being down to chance are likely to be near the threshold of statistical significance. They might be 10%, say, but there’s no way we’re talking about a 20% or 30% likelihood of the humans winning by fluke.
The variance between players
Since my knowledge of statistics is not nearly good enough to approach this theoretically, the best I can do to get a handle on the actual variance is to look at the variation between the players themselves. Obviously, the four humans don’t all play the same way or with exactly the same skill, but if we assume that their relative skill levels are all closer to one another than to Claudico, looking at the variance between their individual performances tells us at least a little bit about what we might expect about their variance collectively.
Li, Polk and Kim won 5290, 2137 and 705 BB off of Claudio respectively, while Les lost 805 BB to the bot. Their average was 1832, giving them deviations of +3458, +305, -1127 and -2637. The standard deviation based on those results is 2251. Only Li’s result is larger than that, so it does seem like their individual results should be taken with a grain of salt.
On the other hand, the way statistics work is that when we combine samples, the variance doesn’t increase linearly. For things like coin-flips and heads-up results, error tends to grow as the square root of the sample. So putting these four players together and quadrupling the sample should approximately double the error to around 4500. The actual margin of victory was 1.63 times this amount, which is just over a 5% shot to happen by chance, so from this perspective too, it seems that the humans fell just short of meeting the researchers expectations for statistical confidence.
So, how should we call it?
According to Byron Spice, the media contact from Carnegie Mellon, the researchers themselves say that the results of the experiment fell “just short” of 95% confidence. This confirms my own back-of-the-envelope guesses above, that the so-called “statistical tie” is only just barely so. I don’t know whether we’re talking about 94.5% confidence, or 93%, or what, but surely above 90%. One can’t blame the university for wanting to focus on the “statistical tie” aspect, and not on the exact odds that their bot just lost fair and square.
As for the rest of us, well, we’re gamblers and not academics. The bottom line is that if there were going to be a rematch next week (with no modifications to Claudico) and someone asked you to lay them 10-1 odds to bet on the machine, you would definitely want to take the bet. If they asked for 20-1, well then you’d probably want to pass. Put that way, is it fair to say that the humans are better? I’ll leave the answer to that one up to you.
Alex Weldon (@benefactumgames) is a freelance writer, game designer and semipro poker player from Montreal, Quebec, Canada.