Yesterday, all of us in the Good Judgment forecasting team received feedback on round 2, which has just finished. (If you have no idea what I’m talking about, please read part 1 of this series.) Time for me to reflect on the past months.
What kinds of questions were asked?
Early in the season, we were told that it would be more difficult than in the first round. This turned out to be true for three reasons. First, the admins simply asked more questions, making it harder to keep up with the tournament. Second, the share of rather obscure items was higher. Questions like the about “the removal of Traian Basescu from the office of President of Romania in a referendum vote before 1 August 2012” did not immediately ring bells with me. Third, and most importantly, the admins introduced conditional and ordered items.
Conditional items looked roughly like this: “Will Israel invade Gaza?”, (a) “if a Hamas rocket reaches Jerusalem”, (b) “if no rocket reaches Jerusalem”. While it is fairly obvious how the condition is thought to affect the probabilities in this examples, other cases were less straightforward. Anyway, this type of question further complicated the process, given that it offers another possibility to instinctively overstate probabilities, or to construct illogical connections between conditions. Some of them even had two sets of intertwined conditions.
For ordered (or other multiple-choice) items, we had to split up a total of 100 percent across three to five answers that covered all of the logically possible answers. This was particularly difficult when the answers related to relatively short periods of time, meaning that you would be greatly punished in case of minuscule deviations. If you thought, for example, that an event was extremely likely to occur within the next 60 days, how much should you bet on the option “61 to 90 days” just to be sure? Another complication: You need to pay attention to the residual categories used to cover “everything else”. In some cases, spending time to decide between options 1 to 4 might be wise, because the residual category 5 really seems to be less important.
When asked “Who will be the next pope?”, however, looking up the odds given to various front-runners turned out to be a waste of time. There was little information to go around, and the ultimate winner had not been on the list of candidates – so the best results were achieved by betting on the residual answer, “someone else”. The lessons: Spending time on the question whether or not anyone would be able to reliably identify the four most likely candidates for this question was far more important than adjudicating their chances relative to each other.
How did we do compared to year 1?
As I’ve said in my previous post on this topic, the share of active participants in my team was smaller than last season. Not surprisingly, the two people with the highest number of answers and exchange of ideas between us did best. Our brier scores – a measure of forecasting accuracy ranging from 0 (best) to 2 (worst) for multiple-choice questions – ended up being around 0.3.
Personally, I scored everywhere between a perfect 0.000 and a horrible 1.056 on individual questions. Questions regarding outbreaks of violence and the removal of dictators all turned out in favor of the status quo, which helped my results. On the other hand, the worst surprises for me were: (1) the last-minute suspension of the IMF loan to Egypt, (2) the weird back and forth with PM Ashraf in Pakistan, and (3) the resignation of Mario Monti.
In other cases, my score was severely lowered because I failed to enter a forecast early on, which meant that I got rated with the team score for that period of time. (If you have some people in your group that seem to assign random numbers to all answers early on, this can hurt you a lot.) Overall, I did slightly better than last year.
Finally, how did we do as a group? Well, the best team in our branch of the tournament (I can only see the team scores for 20 teams) reached a score of 0.278. Our team was ranked third, with a score of 0.321 (which is taken from the median active forecaster, as far as I know). More feedback is due to follow soon, and I will report back here.
Meanwhile, I’ll have to work on that reading list by Jay Ulfelder …