Superforecasting

August 28, 2022

Philip Tetlock is the world expert on the topic of making forecasts.  Tetlock and Dan Gardner have written an excellent book, Superforecasting: The Art and Science of Prediction, which summarizes the mental habits of people who have consistently been able to make useful predictions about the future.  Everyone would benefit if we could predict the future better.  This book teaches us how we can do this if we’re willing to put in the hard work required.

Here’s an outline:

    • An Optimistic Skeptic
    • Illusions of Knowledge
    • Keeping Score
    • Superforecasters
    • Supersmart?
    • Superquants?
    • Supernewsjunkies?
    • Perpetual Beta
    • Superteams
    • The Leader’s Dilemma
    • Are They Really So Super?
    • What’s Next?
    • Ten Commandments for Aspiring Superforecasters
Illustration by Maxim Popov.

 

AN OPTIMISTIC SKEPTIC

Tetlock writes that we are all forecasters.  Whenever we think about making an important decision—changing jobs, getting married, buying a home, making an investment, launching a product, or retiring—we make forecasts about the future.  Tetlock:

Often we do our own forecasting.  But when big events happen—markets crash, wars loom, leaders tremble—we turn to the experts, those in the know.  We look to people like Tom Friedman.

If you are a White House staffer, you might find him in the Oval Office with the president of the United States, talking about the Middle East.  If you are a Fortune 500 CEO, you might spot him in Davos, chatting in the lounge with hedge fund billionaires and Saudi princes.  And if you don’t frequent the White House or swanky Swiss hotels, you can read his New York Times columns and bestselling books that tell you what’s happening now, why, and what will happen next.  Millions do.

Tetlock compares Friedman to Bill Flack, who forecasts global events.  Bill recently retired from the Department of Agriculture.  He is fifty-five and spends some of his time forecasting.

Bill has answered roughly three hundred questions like “Will Russia officially annex additional Ukrainian territory in the next three months?” and “In the next year, will any country withdraw from the eurozone?”  They are questions that matter.  And they’re difficult.  Corporations, banks, embassies, and intelligence agencies struggle to answer such questions all the time.  “Will North Korea detonate a nuclear device before the end of this year?”  “How many additional countries will report cases of the Ebola virus in the next eight months?”  “Will India or Brazil become a permanent member of the UN Security Council in the next two years?”  Some of the questions are downright obscure, at least for most of us.  “Will NATO invite new countries to join the Membership Action Plan (MAP) in the next nine months?”  “Will the Kurdistan Regional Government hold a referendum on national independence this year?”  “If a non-Chinese telecommunications firm wins a contract to provide Internet services in the Shanghai Free Trade Zone in the next two years, will Chinese citizens have access to Facebook and/or Twitter?”  When Bill first sees one of these questions, he may have no clue how to answer it.  “What on earth is the Shanghai Free Trade Zone?” he may think.  But he does his homework.  He gathers facts, balances clashing arguments, and settles on an answer.

Tetlock continues:

No one bases decisions on Bill Flack’s forecasts, or asks Bill to share his thoughts on CNN.  He has never been invited to Davos to sit on a panel with Tom Friedman.  And that’s unfortunate.  Because Bill Flack is a remarkable forecaster.  We know that because each one of Bill’s predictions has been dated, recorded, and assessed for accuracy by independent scientific observers.  His track record is excellent.

Bill is not alone.  There are thousands of others answering the same questions.  All are volunteers.  Most aren’t as good as Bill, but about 2% are.  They include engineers and lawyers, artists and scientists, Wall Streeters and Main Streeters, professors and students… I call them superforecasters because that is what they are.  Reliable evidence proves it.  Explaining why they’re so good, and how others can learn to do what they do, is my goal in this book.

Tetlock points out that it would be interesting to compare superforecasters to an expert like Tom Friedman.  However, Friedman’s track record has never been rigorously tested.  Of course, there are endless opinions about Friedman’s track record.  Tetlock:

Every day, the news media deliver forecasts without reporting, or even asking, how good the forecasters who made the forecasts really are.  Every day, corporations and governments pay for forecasts that may be prescient or worthless or something in between.  And every day, all of us—leaders of nations, corporate executives, investors, and voters—make critical decisions on the basis of forecasts whose quality is unknown.  Baseball managers wouldn’t dream of getting out the checkbook to hire a player without consulting performance statistics… And yet when it comes to the forecasters who help us make decisions that matter far more than any baseball game, we’re content to be ignorant.

That said, forecasting is a skill.  This book, says Tetlock, will show you how.

Prior to his work with superforecasters, Tetlock conducted a 20-year research project in which close to 300 experts made predictions about the economy, stocks, elections, wars, and other issues.  The experts made roughly 28,000 predictions.  On the whole, the experts were no better than chance.  The average expert was no better than a dart-throwing chimpanzee.  Tetlock says he doesn’t mind the joke about the dart-throwing chimpanzee because it makes a valid point:

Open any newspaper, watch any TV news show, and you find experts who forecast what’s coming.  Some are cautious.  More are bold and confident.  A handful claim to be Olympian visionaries able to see decades into the future.  With few exceptions, they are not in front of the cameras because they possess any proven skill at forecasting.  Accuracy is seldom even mentioned.  Old forecasts are like old news—soon forgotten—and pundits are almost never asked to reconcile what they said with what actually happened.  The one undeniable talent that talking heads have is their skill at telling a compelling story with conviction, and that is enough.  Many have become wealthy peddling forecasting of untested value to corporate executives, government officials, and ordinary people who would never think of swallowing medicine of unknown efficacy and safety but who routinely pay for forecasts that are as dubious as elixirs sold from the back of a wagon.

Tetlock is optimistic about the ability of people to learn to be superforecasters.  But he’s also a “skeptic” when it comes to how precisely and how far into the future people can predict.  Tetlock explains:

In a world where a butterfly in Brazil can make the difference between just another sunny day in Texas and a tornado tearing through a town, it’s misguided to think anyone can see very far into the future.

Tetlock on why he’s optimistic:

We know that in so much of what people want to predict—politics, economics, finance, business, technology, daily life—predictability exists, to some degree, in some circumstances.  But there is so much else we do not know.  For scientists, not knowing is exciting.  It’s an opportunity to discover; the more that is unknown, the greater the opportunity.  Thanks to the frankly quite amazing lack of rigor in so many forecasting domains, this opportunity is huge.  And to seize it, all we have to do is set a clear goal—accuracy!—and get serious about measuring.

Tetlock and his research (and life) partner Barbara Mellers launched the Good Judgment Project (GJP) in 2011.  Tetlock:

Cumulatively, more than twenty thousand intellectually curious laypeople tried to figure out if protests in Russia would spread, the price of gold would plummet, the Nikkei would close above 9,500, war would erupt on the Korean peninsula, and many other questions about complex, challenging global issues.  By varying the experimental conditions, we could gauge which factors improved foresight, by how much, over which time frames, and how good forecasts could become if best practices were layered on each other.

The GJP was part of a much larger research effort sponsored by the Intelligence Advanced Research Projects Activity (IARPA).

IARPA is an agency within the intelligence community that reports to the director of National Intelligence and its job is to support daring research that promises to make American intelligence better at what it does.  And a big part of what American intelligence does is forecast global political and economic trends.

(IARPA logo via Wikimedia Commons)

Tetlock continues:

…IARPA created a forecasting tournament in which five scientific teams led by top researchers in the field would compete to generate accurate forecasts on the sorts of tough questions intelligence analysts deal with every day.  The Good Judgment Project was one of those five teams… By requiring teams to forecast the same questions at the same time, the tournament created a level playing field—and a rich treasure trove of data about what works, how well, and when.  Over four years, IARPA posed nearly five hundred questions about world affairs… In all, we gathered over one million individual judgments about the future.

In year 1, GJP beat the official control group by 60%.  In year 2, we beat the control group by 78%.  GJP also beat its university-affiliated competitors, including the University of Michigan and MIT, by hefty margins, from 30% to 70%, and even outperformed professional intelligence analysts with access to classified data.  After two years, GJP was doing so much better than its academic competitors that IARPA dropped the other teams.

What Tetlock learned was two key things:  some people clearly can predict certain events; and the habits of thought of these forecasters can be learned and cultivated “by any intelligent, thoughtful, determined person.”

There’s a question about whether computers can be trained to outpredict superforecasters.  Probably not for some time.  It’s more likely, says Tetlock, the superforecasters working with computers will outperform computers alone and superforecasters alone.  Think of freestyle chess where chess experts using computers are often stronger than computers alone and than human experts alone.

 

ILLUSIONS OF KNOWLEDGE

Archie Cochrane was supposed to die from cancer.  A specialist had done surgery to remove a lump.  During the surgery, it appeared that the cancer had spread.  So the specialist removed the pectoralis minor during the surgery.  He then told Archie Cochrane that he didn’t have long to live.  However, when a pathologist later examined the tissue that had been removed during the surgery, he learned that Cochrane didn’t have cancer at all.  This was a good thing because Archie Cochrane eventually became a revered figure in medicine.  Tetlock comments:

We have all been too quick to make up our minds and too slow to change them.  And if we don’t examine how we make these mistakes, we will keep making them.  This stagnation can go on for years.  Or a lifetime.  It can even last centuries, as the long and wretched history of medicine illustrates.

(Photograph from 1898, via Wikimedia Commons)

Tetlock continues:

When George Washington fell ill in 1799, his esteemed physicians bled  him relentlessly, doused him with mercury to cause diarrhea, induced vomiting, and raised blood-filled blisters by applying hot cups to the old man’s skin.  A physician in Aristotle’s Athens, or Nero’s Rome, or medieval Paris, or Elizabethan London would have nodded at much of that hideous regime.

Washington died… It’s possible that the treatments helped but not enough to overcome the disease that took Washington’s life, or that they didn’t help at all, or that the treatments even hastened Washington’s death.  It’s impossible to know which of these conclusions is true merely by observing that one outcome.  Even with many such observations, the truth can be difficult or impossible to tease out.  There are just too many factors involved, too many possible explanations, too many unknowns.  And if physicians are already inclined to think the treatments work—which they are, or they wouldn’t prescribe them—all that ambiguity is likely to be read in favor of the happy conclusion that the treatments really are effective.

Rigorous experimentation was needed.  But that was never done.  Tetlock brings up the example of Galen, the second-century physician to Roman emperors.  Galen’s writings were “the indisputable source of medical authority for more than a thousand years.”  But Galen never did any experiments.  Galen was a tad overconfident, writing: “All who drink of this treatment recover in a short time, except those whom it does not help, who all die.  It is obvious, therefore, that it fails only in incurable cases.”  Tetlock comments:

Galen was an extreme example but he is the sort of figure who pops up repeatedly in the history of medicine.  They are men (always men) of strong conviction and profound trust in their own judgment.  They embrace treatments, develop bold theories for why they work, denounce rivals as quacks and charlatans, and spread their insights with evangelical passion.

Tetlock again:

Not until the twentieth century did the idea of randomized trial experiments, careful measurement, and statistical power take hold… Randomly assigning people to one group or the other would mean whatever differences there are among them should balance out if enough people participated in the experiment.  Then we can confidently conclude that the treatment caused any differences in observed outcomes.  It isn’t perfect.  There is no perfection in our messy world.  But it beats wise men stroking their chins.

The first serious trials were attempted only after World War II.

But still the physicians and scientists who promoted the modernization of medicine routinely found that the medical establishment wasn’t interested, or was even hostile to their efforts.  “Too much that was being done in the name of health care lacked scientific validation,” Archie Cochrane complained about medicine in the 1950s and 1960s… Physicians and the institutions they controlled didn’t want to let go of the idea that their judgment alone revealed the truth, so they kept doing what they did because they had always done it that way—and they were backed up by respected authority.  They didn’t need scientific validation.  They just knew.  Cochrane despised this attitude.  He called it “the God complex.”

Tetlock describes the two systems we have in our brains: System 1 and System 2.

System 2 is the familiar realm of conscious thought.  It consists of everything we choose to focus on.  By contrast, System 1 is largely a stranger to us.  It is the realm of automatic perceptual and cognitive operations—like those you are running right now to transform the print on this page into a meaningful sentence or to hold the book while reaching for a glass and taking a sip.  We have no awareness of these rapid-fire processes but we could not function without them.  We would shut down.

The numbering of the two systems is not arbitrary.  System 1 comes first.  It is fast and constantly running in the background.  If a question is asked and you instantly know the answer, it sprang from System 1.  System 2 is charged with interrogating that answer.  Does it stand up to scrutiny?  Is it backed by evidence?  This process takes time and effort, which is why the standard routine in decision making is this: System 1 delivers an answer, and only then can System 2 get involved, starting with an examination of what System 1 decided.

Tetlock notes that in the Paleolithic world in which our brains evolved, System 1’s ability to make quick decisions helped us to survive.  A shadow in the grass was immediately assumed to be dangerous.  There was no time for System 2 to second guess.  If System 1 was functioning properly, we would already be running by the time we were consciously aware of the shadow in the grass.

Although System 2 can be trained to think rationally, mathematically, and in terms of statistics, both System 1 and System 2 naturally look first for evidence that confirms a given hypothesis.  The tendency to look for and see only confirming evidence, while disregarding potentially disconfirming evidence, is called confirmation bias.

Confirmation bias is one reason we tend to be overconfident.  Tetlock quotes Daniel Kahneman:

“It is wise to take admissions of uncertainty seriously, but declarations of high confidence mainly tell you that an individual has constructed a coherent story in his mind, not necessarily that the story is true.”

One thing that System 1 does if it faces a question is to substitute an easier question and answer that.  Tetlock calls this bait and switch.  Tetlock gives an example.  The first question is, “Should I worry about the shadow in the long grass?”  System 1 automatically looks for an easier question, like, “Can I easily recall a lion attacking someone from the long grass?”  If the answer to the easier question is “yes,” then the answer to the original question is also “yes.”  Tetlock calls our reliance on the automatic operations of System 1 tip-of-your-nose perspective.

That said, there are areas in life where a human can develop expertise to the point where System 1 intuition can be trusted.  Examples include fire fighting and chess.  Tetlock:

It’s pattern recognition.  With training or experience, people can encode patterns deep in their memories in vast number and intricate detail—such as the estimated fifty thousand to one hundred thousand chess positions that top players have in their repertoire.  If something doesn’t fit a pattern—like a kitchen fire giving off more heat than a kitchen fire should—a competent expert senses it immediately.

Note that developing such expertise requires working in a world with valid cues.  Valid cues exist in the world of firefighting and chess, but much less so in the world of stocks, for example.  If there’s a lot of randomness and it takes a long time to get valid feedback from decisions—which is the case in the world of stocks—then it can take much longer to develop true expertise.

 

KEEPING SCORE

In order to tell whether someone is able to make forecasts with some accuracy, the forecasts must be specific enough to be measured.  Also, by measuring forecasts, it becomes possible for people to improve.

Photo by Redwall

Tetlock writes:

In 1984, with grants from the Carnegie and MacArthur foundations, the National Research Council—the research arm of the United States National Academy of Sciences—convened a distinguished panel charged with nothing less than “preventing nuclear war.”…

The panel did its due diligence.  It invited a range of experts—intelligence analysts, military officers, government officials, arms control experts, and Sovietologists—to discuss the issues.  They… were an impressive bunch.  Deeply informed, intelligent, articulate.  And pretty confident that they knew what was happening and where we were heading.

Both liberals and conservatives agreed that the next Soviet leader would be a Communist Party man.  Both liberals and conservative were confident of their views.

But then something unexpected happened.  The Politburo appointed Mikhail Gorbachev as the next general secretary of the Communist Party of the Soviet Union.  Tetlock:

Gorbachev changed direction swiftly and abruptly.  His policies of glasnost (openness) and perestroika (restructuring) liberalized the Soviet Union.  Gorbachev also sought to normalize relations with the United States and reverse the arms race…

Few experts saw this coming.  And yet it wasn’t long before most of those who didn’t see it coming grew convinced that they knew exactly why it had happened, and what was coming next.

Tetlock comments:

My inner cynic started to suspect that no matter what had happened the experts would have been just as adept at downplaying their predictive failures and sketching an arc of history that made it appear that they saw it coming all along.  After all, the world had just witnessed a huge surprise involving one of the most consequential matters imaginable.  If this didn’t induce a shiver of doubt, what would?  I was not questioning the intelligence or integrity of these experts, many of whom had won big scientific prizes or held high government offices… But intelligence and integrity are not enough.  The national security elites looked a lot like the renowned physicians from the prescientific era… But tip-of-your-nose delusions can fool anyone, even the best and the brightest—perhaps especially the best and the brightest.

In order for forecasts to be measurable, they have to have a specific time frame and what is expected to happen must be clearly defined.  Then there’s the issue of probability.  It’s easy to judge a forecast that something will definitely happen by some specific time.  Jonathan Schell predicted, in his influential book The Fate of the Earth, that a nuclear war would happen by 1985.  Since that didn’t happen, obviously Schell’s prediction was wrong.  But Tetlock asks, what if Schell had said that a nuclear war was “very likely”?  Tetlock:

The only way to settle this definitively would be to rerun history hundreds of times, and if civilization ends in piles of irradiated rubble in most of those reruns, we would know Schell was right.  But we can’t do that, so we can’t know.

Furthermore, phrases such as “very likely” are a problem.  Consider the experience of Sherman Kent.

In intelligence circles, Sherman Kent is a legend.  With a PhD in history, Kent left a faculty position at Yale to join the Research and Analysis Branch of the newly created Coordinator of Information (COI) in 1941.  The COI became the Office of Strategic Services (OSS).  The OSS became the Central Intelligence Agency (CIA).  By the time Kent retired from the CIA in 1967, he had profoundly shaped how the American intelligence community does what it calls intelligence analysis—the methodological examination of the information collected by spies and surveillance to figure out what it means, and what will happen next.

…forecasting is all about estimating the likelihood of something happening, which Kent and his colleagues did for many years at the Office of National Estimates—an obscure but extraordinarily influential bureau whose job was to draw on all information available to the CIA, synthesize it, and forecast anything and everything that might help top officeholders in the US government decide what to do next.

The stakes were high, and Kent weighted each word carefully.  Nonetheless, there was some confusion because they used phrases like “very likely” and “a serious possibility” instead of assigning precise probabilities.  When Kent asked each team member what they had in mind as a specific probability, it turned out that each person had a different probability in mind even though they had all agreed on “a serious possibility.”  The opinions ranged from 20 percent to 80 percent.  Kent grew concerned.

Kent was right to worry.  In 1961, when the CIA was planning to topple the Castro government by landing a small army of Cuban expatriates at the Bay of Pigs, President John F. Kennedy turned to the military for an unbiased assessment.  The Joint Chiefs of Staff concluded that the plan had a “fair chance” of success.  The man who wrote the words “fair chance” later said he had in mind odds of 3 to 1 against success.  But Kennedy was never told precisely what “fair chance” meant and, not unreasonably, he took it to be a much more positive assessment.  Of course we can’t be sure that if the Chiefs had said “We feel it’s 3 to 1 the invasion will fail” that Kennedy would have called it off, but it surely would have made him think harder about authorizing what turned out to be an unmitigated disaster.

In order to solve this problem, Kent suggested specific probabilities be associated with specific phrases.  For instance, “almost certain” would mean a 93% chance, plus or minus 6%, while “probable” would mean a 75% chance, plus or minus 12%.

Unfortunately, Kent’s scheme was never adopted.  Some felt specific probabilities were unnatural.  Others thought it made you sound like a bookie.  (Kent’s famous response: “I’d rather be a bookie than a goddamn poet.”)  Still others objected that specific probabilities seemed too much like objective facts instead of subjective judgments, which they were.  The answer to that objection was simply to make it understood that the estimates were just guesses—opinions—and nothing more.

Tetlock observes that there’s a fundamental obstacle to adopting specific probabilities.  The obstacle has to do with what Tetlock calls the wrong-side-of-maybe fallacy.

If a meteorologist says there’s a 70% chance of rain, is she wrong if it doesn’t rain?  Not necessarily.  To see how right or wrong the estimate is, we would have to rerun the day a hundred times.  If it rained in 70% of the reruns, then the meteorologist would have been exactly right with her 70% estimate.  Unfortunately, however, that’s not how people judge it.  If the meteorologist estimates a 70% chance of rain, then if it rains, she was right, but if it doesn’t rain, she was wrong, according to how most people think about it.  Tetlock writes:

The prevalence of this elementary error has a terrible consequence.  Consider that if an intelligence agency says there is a 65% chance that an event will happen, it risks being pilloried if it does not—and because the forecast itself says there is a 35% chance it will not happen, that’s a big risk.  So what’s the safe thing to do?  Stick with elastic language… If the event happens, “a fair chance” can retroactively be stretched to mean something considerably bigger than 50%—so the forecaster nailed it.  If it doesn’t happen, it can be shrunk to something much smaller than 50%—and again the forecaster nailed it.  With perverse incentives like these, it’s no wonder people prefer rubbery words over firm numbers.

Tetlock observes that it wasn’t until after the debacle regarding Saddam Hussein’s purported weapons of mass destruction, and the reforms that followed, that expressing probabilities with numbers became more accepted.  But that’s within the intelligence community.  In the broader community, especially in the media, very vague language is still common.  Tetlock:

If we are serious about measuring and improving, this won’t do.  Forecasts must have clearly defined terms and timelines.  They must use numbers.  And one more thing is essential: we must have lots of forecasts.

Consider the meteorologist again.  If she makes a new forecast each day, then over time her track record can be determined.  If she’s perfect, then 70% of the time she says there’s a 70% chance of rain, it rains, and so forth.  That would be perfect calibration.  If it only rains 40% of the time she says there’s a 70% chance of rain, then she’s overconfident.  If it rains 80% of the time she says there’s a 30% chance of rain, then she’s underconfident.

Of course, when you consider something like presidential elections, which happen every four years, it could take a long time to build about enough predictions for testing purposes.  And some events are even rarer than every four years.

Besides calibration, there is also “resolution.”  If someone assigns very high probabilities—80% to 100%—to things that happen, and very low probabilities—0% to 20%—to things that don’t happen, then they have good resolution.

The math behind this system was developed by Glenn W. Brier in 1950, hence results are called Brier scores.  In effect, Brier scores measure the distance between what you forecast and what actually happened… Perfection is 0.  A hedged fifty-fifty call, or random guessing in the aggregate, will produce a Brier score of 0.5.  A forecast that is wrong to the greatest possible extent—saying there is a 100% chance that something will happen and it doesn’t, every time—scores a disastrous 2.0, as far from The Truth as it is possible to get.

Tetlock writes about the 20 years he spent gathering roughly 28,000 predictions by 284 experts.  This was the Expert Political Judgment (EPJ) project.  Tetlock:

If you didn’t know the punch line of EPJ before you read this book, you do now: the average expert was roughly as accurate as a dart-throwing chimpanzee.

Tetlock then notes that there were two statistically distinguishable groups.  One group failed to do better than random guessing.  The second group beat the chimp, but not by much.

So why did one group do better than the other?  It wasn’t whether they had PhDs or access to classified information.  Nor was it what they thought—whether they were liberals or conservatives, optimists or pessimists.  The critical factor was how they thought.

Tetlock explains:

One group tended to organize their thinking around Big Ideas, although they didn’t agree on which Big Ideas were true or false… As ideologically diverse as they were, they were united by the fact that their thinking was so ideological.  They sought to squeeze complex problems into the preferred cause-effect templates and treated what did not fit as irrelevant distractions.  Allergic to wishy-washy answers, they kept pushing their analyses to the limit (and then some), using terms like “furthermore” and “moreover” while piling up reasons why they were right and others wrong.  As a result, they were unusually confident and likelier to declare things “impossible” or “certain.”  Committed to their conclusions, they were reluctant to change their minds even when their predictions clearly failed.  They would tell us, “Just wait.”

The other group consisted of more pragmatic experts who drew on many analytical tools, with the choice of tool hinging on the particular problem they faced.  These experts gathered as much information from as many sources as they could.  When thinking, they often shifted mental gears, sprinkling their speech with transition markers such as “however,” “but,” “although,” and “on the other hand.”  They talked about possibilities and probabilities, not certainties.  And while no one likes to say “I was wrong,” these experts more readily admitted it and changed their minds.

The first group above are called hedgehogs, while the second group are called foxes.  Hedgehogs know one big thing, while foxes know many things.  Foxes beat hedgehogs on both calibration and resolution.  Moreover, hedgehogs actually did slightly worse than random guessing.

Tetlock compares the hedgehog’s Big Idea to a pair of green-tinted glasses that he never takes off.

So the hedgehog’s one Big Idea doesn’t improve his foresight.  It distorts it.  And more information doesn’t help because it’s all seen through the same tinted glasses.  It may increase the hedgehog’s confidence, but not his accuracy.  That’s a bad combination.  The predictable result?  When hedgehogs in the EPJ research made forecasts on the subjects they knew the most about—their own specialties—their accuracy declined.

Perhaps not surprisingly, the more famous an expert in EPJ was, the less accurate.

That’s not because editors, producers, and the public go looking for bad forecasters.  They go looking for hedgehogs, who just happen to be bad forecasters.  Animated by a Big Idea, hedgehogs tell tight, simple, clear stories that grab and hold audiences… Better still, hedgehogs are confident… The simplicity and confidence of the hedgehog impairs foresight, but it calms nerves—which is good for the careers of hedgehogs.

What about foxes in the media?

Foxes don’t fare so well in the media.  They’re less confident, less likely to say something is “certain” or “impossible,” and are likelier to settle on shades of “maybe.”  And their stories are complex, full of “howevers” and “on the other hands,” because they look at problems one way, then another, and another.  This aggregation of many perspectives is bad TV.  But it’s good forecasting.  Indeed, it’s essential.

 

SUPERFORECASTERS

Tetlock writes:

After invading in 2003, the United States turned Iraq upside down looking for WMDs but found nothing.  It was one of the worst—arguably the worst—intelligence failure in modern history.

The question, however, is not whether the Intelligence Community’s conclusion was correct, but whether it was reasonable on the basis of known information.  Was it reasonable?  Yes.  It sure looked like Saddam was hiding something.  Else why play hide-and-seek with UN arms inspectors that risks triggering an invasion and your own downfall?

It’s difficult to evaluate whether the conclusion was reasonable because, looking back, we know that it was wrong.

This particular bait and switch—replacing “Was it a good decision?” with “Did it have a good outcome?”—is both popular and pernicious.

(Illustration by Alain Lacroix)

Think of it in terms of poker.  A beginner may overestimate his odds, bet big, get lucky and win.  But that doesn’t mean the bet was a good decision.  Similarly, a good poker pro may correctly estimate her odds, bet big, get unlucky and lose.  But that doesn’t mean the bet was a bad decision.

In this case, the evidence seems to show that the conclusion of the Intelligence Community (IC) regarding Iraq’s WMDs was reasonable on the basis of known information, even though it turned out later to be wrong.

However, even though the IC’s conclusion was reasonable, it could have been better because it would have expressed less certainty had all the information been carefully considered.  In other words, the IC would have reached the same conclusion, but it wouldn’t have been associated with a probability so close to 100%.  Tetlock:

The congressional resolution authorizing the use of force might not have passed and the United States might not have invaded.  Stakes rarely get much higher than thousands of lives and trillions of dollars.

The IC didn’t even admit the possibility that they could be wrong about Iraq’s WMDs.  Normally, if there was any doubt at all, you’d have some analysts required to present an opposing view.  But nothing like that happened because the IC was so certain of its conclusion.

In 2006 the Intelligence Advanced Research Projects Activity (IARPA) was created.  Its mission is to fund cutting-edge research with the potential to make the intelligence community smarter and more effective…

In 2008 the Office of the Director of National Intelligence—which sits atop the entire network of sixteen intelligence agencies—asked the National Research Council to form a committee.  The task was to synthesize research on good judgment and help the IC put that research to good use.  By Washington’s standards, it was a bold (or rash) thing to do.  It’s not every day that a bureaucracy pays one of the world’s most respected scientific institutions to produce an objective report that might conclude that the bureaucracy is clueless.

The report delivered was delivered two years later.

“The IC should not rely on analytical methods that violate well-documented behavioral principles or that have no evidence of efficacy beyond their intuitive appeal,” the report noted.  The IC should “rigorously test current and proposed methods under conditions that are as realistic as possible.  Such an evidence-based approach to analysis will promote the continuous learning need to keep the IC smarter and more agile than the nation’s adversaries.”

The IC does a good job teaching its analysts the correct process for doing research and reaching judgments.  The IC also does a good job holding analysts accountable for following the correct process.  However, the IC doesn’t hold analysts accountable for the accuracy of their judgments.  There’s no systematic tracking of the accuracy of judgments.  That’s the biggest problem.  IARPA decided to do something about this.

IARPA would sponsor a massive tournament to see who could invent the best methods of making the sorts of forecasts that intelligence analysts make every day.  Will the president of Tunisia flee to a cushy exile in the next month?  Will an outbreak of H5N1 in China kill more than ten in the next six months?  Will the euro fall below $1.20 in the next twelve months?

[…]

The research teams would compete against one another and an independent control group.  Teams had to beat the combined forecast—the “wisdom of the crowd”—of the control group, and by margins we all saw as intimidating.  In the first year, IARPA wanted teams to beat that standard by 20%—and it wanted that margin of victory to grow to 50% by the fourth year.

But that was only part of IARPA’s plan.  Within each team, researchers could run Archie Cochrane-style experiments to assess what really works against internal control groups.  Researchers might think, for example, that giving forecasters a basic training exercise would improve their accuracy… Give the training to one randomly chosen group of forecasters but not another.  Keep all else constant.  Compare results.

Tetlock put together a team of 3,200 forecasters.  His team was called the Good Judgment Project (GJP).  Tetlock called IARPA’s tournament “gutsy” because of what it could reveal.

Here’s one possible revelation: Imagine you get a couple of hundred ordinary people to forecast geopolitical events.  You see how often they revise their forecasts and how accurate those forecasts prove to be and use that information to identify the forty or so who are the best.  Then you have everyone makes lots more forecasts.  This time, you calculate the average forecast of the whole group—“the wisdom of the crowd”—but with extra weight given to those forty top forecasters.  Then you give the forecast a final tweak: You “extremize” it, meaning you push it closer to 100% or zero.  If the forecast is 70% you might bump it up to, say, 85%.  If it’s 30%, you might reduce it to 15%.

Now imagine that the forecasts you produce this way beat those of every other group and method available, often by large margins.  Your forecasts even beat those of professional intelligence analysts inside the government who have access to classified information—by margins that remain classified.

Think how shocking it would be to the intelligence professionals who have spent their lives forecasting geopolitical events—to be beaten by a few hundred ordinary people and some simple algorithms.

It actually happened.  What I’ve described is the method we used to win IARPA’s tournament.  There is nothing dazzlingly innovative about it.  Even the extremizing tweak is based on a pretty simple insight: When you combine the judgments of a large group of people to calculate the “wisdom of the crowd” you collect all the relevant information that is dispersed among all those people.  But none of those people has access to all that information… What would happen if every one of those people were given all the information?  They would become more confident—raising their forecasts closer to 100% or zero.  If you then calculated the “wisdom of the crowd” it too would be more extreme.  Of course it’s impossible to give every person all the relevant information—so we extremize to simulate what would happen if we could.

Tetlock continues:

Thanks to IARPA, we now know a few hundred ordinary people and some simple math can not only compete with professionals supported by a multibillion-dollar apparatus but also beat them.

And that’s just one of the unsettling revelations IARPA’s decision made possible.  What if the tournament discovered ordinary people who could—without the assistance of any algorithmic magic—beat the IC?  Imagine how threatening that would be.

Tetlock introduces one of the best forecasters on GJP: Doug Lorch, a retired computer programmer who “doesn’t look like a threat to anyone.”  Out of intellectual curiosity, Lorch joined GJP.

Note that in the IARPA tournament, a forecaster could update her forecast in real time.  She may have thought there was a 60% chance some event would happen by the six-month deadline, and then read something convincing her to update her forecast to 75%.  For scoring purposes, each update counts as a separate forecast.  Tetlock:

Over four years, nearly five hundred questions about international affairs were asked of thousands of GJP’s forecasters, generating well over one million judgments about the future.  But even at the individual level, the numbers quickly added up.  In year 1 alone, Doug Lorch made roughly one thousand separate forecasts.

Doug’s accuracy was as impressive as his volume.  At the end of the first year, Doug’s overall Brier score was 0.22, putting him in the fifth spot among the 2,800 competitors in the Good Judgment Project…

In year 2, Doug joined a superforecaster team and did even better, with a final Brier score of 0.14, making him the best forecaster of the 2,800 GJP volunteers.  He also beat by 40% a prediction market in which traders bought and sold futures contracts on the outcomes of the same questions.  He was the only person to beat the extremizing algorithm.  And Doug not only beat the control group’s “wisdom of the crowd,” he surpassed it by more than 60%, meaning that he single-handedly exceeded the fourth-year performance target that IARPA set for multimillion-dollar research programs that were free to use every trick in the forecasting textbook for improving accuracy.

Tetlock points out that Doug Lorch was not uniquely gifted:

There were 58 others among the 2,800 volunteers who scored at the top of the charts in year 1.  They were our first class of superforecasters.  At the end of year 1, their collective Brier score was 0.25, compared with 0.37 for all the other forecasters—and that gap grew in later years so that by the end of the four-year tournament, superforecasters had outperformed regulars by 60%.  Another gauge of how good superforecasters were is how much further they could see into the future.  Across all four years of the tournament, superforecasters looking out three hundred days were more accurate than regular forecasters looking out one hundred days.  In other words, regular forecasters needed to triple their foresight to see as far as superforecasters.

What if the superforecasters’ performance was due to luck?  After all, if you start with 2,800 people flipping coins, some of those people by sheer luck will flip a high proportion of heads.  If luck was a significant factor, then we should expect regression to the mean: the superforecasters in year 1 should perform less well, on the whole, in year 2.  But that didn’t happen.  That doesn’t mean luck isn’t a factor, because it is a factor in many of the questions that were asked.  Tetlock writes:

So we have a mystery.  If chance is playing a significant role, why aren’t we observing significant regression of superforecasters as a whole toward the overall mean?  An offsetting process must be pushing up superforecasters’ performance numbers.  And it’s not hard to guess what that was: after year 1, when the first cohort of superforecasters was identified, we congratulated them, anointed them “super,” and put them on teams with fellow superforecasters.  Instead of regressing toward the mean, their scores got even better.  This suggests that being recognized as “super” and placed on teams of intellectually stimulating colleagues improved their performance enough to erase regression to the mean we would otherwise have seen.  In years 3 and 4, we harvested fresh crops of superforecasters and put them to work in elite teams.  That gave us more apples-to-apples comparisons.  The next cohorts continued to do as well or better than they did in the previous year, again contrary to the regression hypothesis.

That’s not to say there was no regression to the mean.  Each year, roughly 30% of the individual superforecasters fall from the top 2% the next year.  This confirms that luck is a factor.  As Tetlock points out, even superstar athletes occasionally look less than stellar.

 

SUPERSMART?

Tetlock introduces Sanford “Sandy” Sillman.  In 2008, he was diagnosed with multiple sclerosis, which was debilitating.  Walking was difficult.  Even typing was a challenge.  Sandy had to retire from his job as an atmospheric scientist.

How smart is Sandy?  He earned a double major in math and physics from Brown University, plus a master of science degree from MIT’s technology and policy program, along with a second master’s degree, in applied mathematics, from Harvard, and finally a PhD in applied physics from Harvard.  Furthermore, Sandy’s intelligence isn’t confined to math and physics.  He is fluent in French, Russian, Italian, and Spanish.

In year 1 of GJP, Sandy finished with an overall Brier score of 0.19, which put him in a tie for overall champion.

There’s an obvious question about whether superforecasters are simply more knowledgeable and intelligent than others.  Tetlock and Barbara Mellers tested forecasters.  It turns out that although superforecasters have well above average intelligence, they did not score off-the-charts high and most fall well short of so-called genius territory, which if often defined as the top 1%, or an IQ of 135 and up.

Tetlock concludes that knowledge and intelligence help, but they add little beyond a certain threshold.

But having the requisite knowledge and intelligence is not enough.  Many clever and informed forecasters in the tournament fell far short of superforecaster accuracy.  And history is replete with brilliant people who made forecasts that proved considerably less than prescient.

For someone to become a superforecaster, she must develop the right habits of thinking.   Tetlock gives an example.

DAVOS/SWITZERLAND, 01/28/2001 – President of the Palestinian Authority Yasser Arafat.  (Wikimedia Commons)

On October 12, 2004, Yasser Arafat became severely ill with vomiting and abdominal pain.  On November 11, 2004, Arafat was pronounced dead.  There was speculation that he had been poisoned.  In July 2012, scientists at Switzerland’s Lausanne University Institute of Radiation Physics announced that they had tested some of Arafat’s belonging and found high levels of Polonium-210, a radioactive element that can be deadly if ingested.

Two separate agencies, one in France and one in Switzerland, decided to test Arafat’s body.  So IARPA asked forecasters the following question: “Will either the French or Swiss inquiries find elevated levels of polonium in the remains of Yasser Arafat’s body?”

How would someone answer this question?  Most people would follow their hunch about the matter.  Some people might feel that “Israel would never do that!”, while others might feel that “Of course Israel did it!”  Following your hunch is not the right way to approach the question.  How did the superforecaster Bill Flack answer the question?

…Bill asked himself how Arafat’s remains could have been contaminated with enough polonium to trigger a positive result.  Obviously, “Israel poisoned Arafat” was one way.  But because Bill carefully broke the question down, he realized there were others.  Arafat had many Palestinian enemies.  They could have poisoned him.  It was also possible that there had been “intentional postmortem contamination by some Palestinian faction looking to give the appearance that Israel had done a Litvinenko on Arafat,” Bill told me later.  These alternatives mattered because each additional way Arafat’s body could have been contaminated with polonium increased the probability that it was.

What’s the next step?  Before getting to this, Tetlock describes in detail an American family, the Renzettis, who have one child and asks how likely it is that they have a pet.  The first step in answering that question is to learn what percentage of American households have a pet.

Statisticians call that the best rate—how common something is within a broader class.  Daniel Kahneman has a much more evocative visual term for it.  He calls it the “outside view”—in contrast to the “inside view,” which is the specifics of the particular case.  A few minutes with Google tells me about 62% of American households own pets.  That’s the outside view here.  Starting with the outside view means I will start by estimating that there is a 62% chance the Renzettis have a pet.  Then I will turn to the inside view—all those details about the Renzettis—and use them to adjust that initial 62% up or down.

Tetlock comments:

It’s natural to be drawn to the inside view.  It’s usually concrete and filled with engaging detail we can use to craft a story about what’s going on.  The outside view is typically abstract, bare, and doesn’t lend itself so readily to storytelling.  So even smart, accomplished people routinely fail to consider the outside view.

Tetlock writes:

Here we have a famous person who is dead.  Major investigative bodies think there is enough reason for suspicion that they are exhuming the body.  Under those circumstances, how often would the investigation turn up evidence of poisoning?  I don’t know and there is no way to find out.  But I do know there is at least a prima facie case that persuades courts and medical investigators that this is worth looking into.  It has to be considerably above zero.  So let’s say it’s at least 20%.  But the probability can’t be 100% because if it were that clear and certain the evidence would have been uncovered before burial.  So let’s so the probability cannot be higher than 80%.  That’s a big range.  The midpoint is 50%.  So that outside view can serve as our starting point.

Someone might wonder why you couldn’t start with the inside view and then consider the outside view.  This wouldn’t work, however, due to anchoring.  When we make estimates, we tend to start with some number and adjust.  If we start with the outside view, we have reasonable starting point, but if we start with the inside view, we may end up anchoring on a number that is not a reasonable starting point.

Having established the outside view on the Arafat question, we next turn to the inside view.  What are the hypotheses?  Israel could have poisoned Arafat.  Arafat’s Palestinian enemies could have poisoned him.  Arafat’s remains could have been contaminated to make it look like he was poisoned.  Tetlock:

Start with the first hypothesis: Israel poisoned Yassar Arafat with polonium.  What would it take for that to be true?

    • Israel had, or could obtain, polonium.
    • Israel wanted Arafat dead badly enough to take a big risk.
    • Israel had the ability to poison Arafar with polonium.

Each of these elements could then be researched—looking for evidence pro and con—to get a sense of how likely they are to be true, and therefore how likely the hypothesis is to be true.  Then it’s on to the next hypothesis.  And the next.

This sounds like detective work because it is—or to be precise, it is detective work as real investigators do it, not the detectives on TV shows.  It’s methodical, slow, and demanding.

Tetlock concludes:

A brilliant puzzle solver may have the raw material for forecasting, but if he doesn’t also have an appetite for questioning basic, emotionally charged beliefs he will often be at a disadvantage relative to a less intelligent person who has a greater capacity for self-critical thinking.  It’s not the raw crunching power you have that matters most.  It’s what you do with it.

[…]

For superforecasters, beliefs are hypotheses to be tested, not treasures to be guarded.

 

SUPERQUANTS?

Perhaps superforecasters are exceptionally good using math to make their forecasts?  Although occasionally a superforecaster will consult a mathematical model, the vast majority of the time a superforecaster uses very little math.

That said, superforecasters tend to be granular in their probability estimates.  Is this justified?  Tetlock:

So how can we know that the granularity we see among superforecasters is meaningful?… The answer lies in the tournament data.  Barbara Mellers has shown that granularity predicts accuracy: the average forecaster who sticks with the tens—20%, 30%, 40%—is less accurate than the finer-grained forecaster who uses fives—20%, 25%, 30%—and still less accurate than the even finer-grained forecaster who uses ones—20%, 21%, 22%.

 

SUPERNEWSJUNKIES?

Tetlock:

Superforecasting isn’t a paint-by-numbers method but superforecasters often tackle questions in a roughly similar way—one that any of us can follow: Unpack the question into components.  Distinguish as sharply as you can between the known and unknown and leave no assumptions unscrutinized.  Adopt the outside view and put the problem into a comparative perspective that downplays its uniqueness and treats it as a special case of a wider class of phenomena.  Then adopt the inside view that plays up the uniqueness of the problem.  Also explore the similarities and differences between your views and those of others—and pay special attention to prediction markets and other methods of extracting wisdom from crowds.  Synthesize all these different views into a single vision as acute as that of a dragonfly.  Finally, express your judgment as precisely as you can, using a finely grained scale of probability.

This is just the beginning.  The next step, which is typically repeated many times, is to update your predictions as you get new information.

Photo by Marek Uliasz.

Superforecasters update their forecasts much more frequently than regular forecasters.  One might think that superforecasters are super because they are news junkies.  However, their initial forecasts were at least 50% more accurate than those of regular forecasters.  Furthermore, properly updating forecasts on the basis of new information requires the same skills used in making the initial forecasts.   Tetlock:

…there are two dangers a forecaster faces after making the initial call.  One is not giving enough weight to new information.  That’s underreaction.  The other danger is overreacting to new information, seeing it as more meaningful than it is, and adjusting a forecast too radically.

Both under- and overreaction can diminish accuracy.  Both can also, in extreme cases, destroy a perfectly good forecast.

One typical reason for underreaction to new information is that people often become committed to their beliefs.  Especially when they have publicly committed to their beliefs and when the have an ego investment.  Superforecasters, by contrast, have no trouble changing their minds on the basis of new information.

While superforecasters update their forecasts much more often than regular forecasters, they do so in small increments.  They tend not to overreact.

 

PERPETUAL BETA

Superforecasters have a growth mindset.  They believe they can get better with work, and they do.

(Illustration by Tereza Paskova)

Tetlock notes that John Maynard Keynes failed twice as an investor, and was almost wiped out during the first failure, before he settled on a value investing approach.  Keynes came to understand value investing through his own reflection.  He didn’t learn about it from Ben Graham, who is regarded as the father of value investing.  In any case, Keynes turned out to be enormously successful as an investor despite investing during the Great Depression of the 1930s.

Keynes had a growth mindset: try, fail, analyze, adjust, try again.

Improving requires getting good feedback.  Meteorologists get feedback on whether their forecasts were correct on a daily basis.  Bridge players also get fairly immediate feedback on how well they’re playing.  The trouble for forecasters is twofold:  first, the forecasts must be specific enough to be testable; and second, the time lag between the forecast and the result is often long, unlike for meteorologists or bridge players.

Pulling It All Together—Portrait of a Superforecaster

In philosophic outlook, they tend to be:

    • Cautious: Nothing is certain.
    • Humble: Reality is infinitely complex.
    • Nondeterministic: What happens is not meant to be and does not have to happen.

In their abilities and thinking styles, they tend to be:

    • Actively Open-minded: Beliefs are hypotheses to be tested, not treasures to be protected.
    • Intelligent and Knowledgeable, with a “Need for Cognition”: Intellectually curious, enjoy puzzles and mental challenges.
    • Reflective: Introspective and self-critical.
    • Numerate: Comfortable with numbers.

In their methods of forecasting, they tend to be:

    • Pragmatic: Not wedded to any idea or agenda.
    • Analytical: Capable of stepping back from the tip-of-your-nose perspective and considering other views.
    • Dragonfly-eyed: Value diverse views and synthesize them into their own.
    • Probabilistic: Judge using many grades of maybe.
    • Thoughtful Updaters: When facts change, they change their minds.
    • Good Intuitive Psychologists: Aware of the value of checking thinking for cognitive and emotional biases.

In their work ethic, they tend to have:

    • A Growth Mindset: Believe it’s possible to get better.
    • Grit: Determined to keep at it however long it takes.

Tetlock concludes by noting that the strongest predictor of a forecaster becoming a superforecaster is perpetual beta, the degree to which one is committed to belief updating and self-improvement.  Perpetual beta is three times as powerful a predictor as its closet rival, intelligence.

 

SUPERTEAMS

(Photo by Chrisharvey)

Tetlock writes:

[Groups] let people share information and perspectives.  That’s good.  It helps make dragonfly eye work, and aggregation is critical to accuracy.  Of course aggregation can only do its magic when people form judgments independently, like the fairgoers guessing the weight of the ox.  The independence of judgments ensures that errors are more or less random, so they cancel each other out.  When people gather and discuss in a group, independence of thought and expression can be lost.  Maybe one person is a loudmouth who dominates the discussion, or a bully, or a superficially impressive talker, or someone with credentials that cow others into line.  In so many ways, a group can get people to abandon independent judgment and buy into errors.  When that happens, the mistakes will pile up, not cancel out.

The GJP randomly assigned several hundred forecasters to work alone and several hundred to work in teams.  At the end of the first year, the teams had been 23% more accurate than the individuals.  GJP kept experimenting.

The results speak for themselves.  On average, when a forecaster did well enough in year 1 to become a superforecaster, and was put on a superforecaster team in year 2, that person became 50% more accurate.  An analysis in year 3 got the same result.  Given that these were collections of strangers tenuously connected in cyberspace, we found that result startling.

Tetlock adds:

How did superteams do so well?  By avoiding the extremes of groupthink and Internet flame wars.  And by fostering minicultures that encouraged people to challenge each other respectfully, admit ignorance, and request help.

 

THE LEADER’S DILEMMA

Good leaders have to be confident and decisive.  But good leaders also need good forecasts in order to make good decisions.  We’ve seen that effective forecasting requires self-critical questioning and doubt.  How can a leader be both a forecaster and a leader?  Tetlock:

Fortunately, the contradiction between being a superforecaster and a superleader is more apparent than real.  In fact, the superforecaster model can help make good leaders superb and the organizations they lead smart, adaptable, and effective.  The key is an approach to leadership and organization first articulated by a nineteenth-century Prussian general, perfected by the German army of World War II, made foundational doctrine by the modern American military, and deployed by many successful corporations today.  You might even find it at your neighborhood Walmart.

Helmuth von Moltke the Elder (1800-1891). Photo by Georgios Kollidas.

Helmuth von Moltke was a famous nineteenth-century Prussian general.  One of Moltke’s axioms is: “In war, everything is uncertain.”  While having a plan is important, you can never entirely trust your plan.  Moltke: “No plan of operations extends with certainty beyond the first encounter with the enemy’s main strength.”  Moltke also wrote that, “It is impossible to lay down binding rules” that apply in all circumstances.  “Two cases will never be exactly the same” in war.  Improvisation is essential.  Tetlock:

Moltke trusted that his officers were up to the task… In Germany’s war academies, scenarios were laid out and students were invited to suggest solutions and discuss them collectively.  Disagreement was not only permitted, it was expected, and even the instructor’s views could be challenged… Even the views of generals were subject to scrutiny.

Tetlock continues:

What ties all of this together… is the command principle of Auftragstaktik.  Usually translated today as “mission command,” the basic idea is simple. “War cannot be conducted from the green table,” Moltke wrote, using an expression that referred to top commanders at headquarters.  “Frequent and rapid decisions can be shaped only on the spot according to estimates of local conditions.”  Decision-making power must be pushed down the hierarchy so that those on the ground—the first to encounter surprises on the evolving battlefield—can respond quickly.  Of course those on the ground don’t see the bigger picture.  If they made strategic decisions the army would lose coherence and become a collection of tiny units, each seeking its own ends.  Auftragstaktik blended strategic coherence and decentralized decision making with a simple principle: commanders were to tell subordinates what their goal is but not how to achieve it.

 

ARE THEY REALLY SO SUPER?

Are the superforecasters super primarily due to innate abilities or primarily due to specific habits they’ve developed?   Tetlock answers:

They score higher than average on measures of intelligence and open-mindedness, although they are not off the charts.  What makes them so good is less what they are than what they do—the hard work of research, the careful thought and self-criticism, the gathering and synthesizing of other perspectives, the granular judgments and relentless updating.

Tetlock again:

My sense is that some superforecasters are so well practiced in System 2 corrections—such as stepping back to take the outside view—that these techniques have become habitual.  In effect, they are now part of their System 1… No matter how physically or cognitively demanding a task may be—cooking, sailing, surgery, operatic singing, flying fighter jets—deliberative practice can make it second nature.

Very little is predictable five or more years into the future.  Tetlock confirmed this fact in his EPJ research: Expert predictions declined towards chance five years out.  And yet governments need to make plans that extend five or more years into the future.  Tetlock comments:

Probability judgments should be explicit so we can consider whether they are as accurate as they can be.  And if they are nothing but a guess, because that’s the best we can do, we should say so.  Knowing what we don’t know is better than thinking we know what we don’t.

Tetlock adds:

Kahneman and other pioneers of modern psychology have revealed that our minds crave certainty and when they don’t find it, they impose it.  In forecasting, hindsight bias is the cardinal sin.  Recall how experts stunned by the Gorbachev surprise quickly became convinced it was perfectly explicable, even predictable, although they hadn’t predicted it.

Brushing off surprises makes the past look more predictable than it was—and this encourages the belief that the future is much more predictable that it is.

Tetlock concludes:

Savoring how history could have generated an infinite array of alternative outcomes and could now generate a similar array of alternative futures, is like contemplating the one hundred billion known stars in our galaxy and the one hundred billion known galaxies.  It instills profound humility.

…But I also believe that humility should not obscure the fact that people can, with considerable effort, make accurate forecasts about at least some developments that really do matter.

 

WHAT’S NEXT?

Tetlock hopes that the forecasting lessons discussed in his book will be widely adopted:

Consumer of forecasts will stop being gulled by pundits with good stories and start asking pundits how their past predictions fared—and reject answers that consist of nothing but anecdotes and credentials.  Just as we now expect a pill to have been tested in peer-reviewed experiments before we swallow it, we will expect forecasters to establish the accuracy of their forecasting with rigorous testing before we heed their advice.  And forecasters themselves will realize… that these higher expectations will ultimately benefit them, because it is only with the clear feedback that comes from rigorous testing that they can improve their foresight.  It could be huge—an “evidence-based forecasting” revolution similar to the “evidence-based medicine” revolution, with consequences every bit as significant.

Or nothing may change.  Revolutionaries aren’t supposed to say failure is possible, but let’s think like superforecasters here and acknowledge that things may go either way.

Change

Tetlock writes about a Boston doctor named Ernest Amory Codman who proposed an idea he called the End Result System.

Hospitals should record what ailments incoming patients had, how they were treated, and—most important—the end result of each case.  These records should be compiled and statistics released so consumers could choose hospitals on the basis of good evidence.  Hospitals would respond to consumer pressure by hiring and promoting doctors on the same basis.  Medicine would improve, to the benefit of all.

Today, hospitals do much of what Codman suggested, and physicians have embraced such measurement.  But when Codman first put forth his idea, the medical establishment rejected it.  Hospitals didn’t want to pay for record keepers.  And doctors were already respected.  Keeping score could only hurt their reputations.   Codman was fired from Massachusetts General Hospital.  And Codman lost his teaching post at Harvard.  Eventually, however, his core idea was accepted.

Other areas of society are following the example of evidence-based medicine.  There’s evidence-based government policy.  There’s evidence-based philanthropy, led by the Gates Foundation.  There’s evidence-based management of sports teams.

One possible criticism of superforecasting is that it doesn’t deal with big enough questions.  Tetlock responds that a lot of smaller questions can ultimately shed light on bigger questions.  He calls this Bayesian question clustering.  For instance, in considering a big question like whether there will be another Korean war, you can focus on smaller questions related to missile launches, nuclear tests, cyber attacks, and artillery shelling.  The answers are cumulative.  The more yeses, the more likely the overall situation will end badly.

It’s interesting that hedgehog-minded experts are often good at coming up with big questions even though they’re usually not good at making forecasts.  Tetlock writes:

While we may assume that a superforecaster would also be a superquestioner, we don’t actually know that.  Indeed, my best scientific guess is that they often are not.  The psychological recipe for the ideal superforecaster may prove to be quite different from that for the ideal superquestioner, as superb question generation often seems to accompany a hedgehog-like incisiveness and confidence that one has a Big Idea grasp of the deep drivers of an event.  That’s quite a different mindset from the foxy eclecticism and sensitivity to uncertainty that characterizes superb forecasting.

Tetlock continues:

Superforecasters and superquestioners need to acknowledge each other’s complementary strengths, not dwell on each other’s alleged weaknesses.  Friedman poses provocative questions that superforecasters should use to sharpen their foresight; superforecasters generate well-calibrated answers that superquestioners should use to fine-tune and occasionally overhaul their mental models of reality…

…But there’s a much bigger collaboration I’d like to see.  It would be the Holy Grail of my research program: using forecasting tournaments to depolarize unnecessarily polarized policy debates and make us collectively smarter.

 

TEN COMMANDMENTS FOR ASPIRING SUPERFORECASTERS

Triage.

Focus on questions where your hard work is likely to pay off.  Don’t waste time either on easy “clocklike” questions (where simple rules of thumb can get you close to the right answer) or on impenetrable “cloud-like” questions (where even fancy statistical models can’t beat the dart-throwing chimp).  Concentrate on questions in the Goldilocks zone of difficulty, where effort pays off the most.

Break seemingly intractable problems into tractable sub-problems.

Decompose the problem into its knowable and unknowable parts.  Flush ignorance into the open.  Expose and examine your assumptions.  Dare to be wrong by making your best guesses.  Better to discover errors quickly than to hide them behind vague verbiage… The surprise is now often remarkably good probability estimates arise from a remarkably crude series of assumptions and guesstimates.

Strike the right balance between inside and outside views.

Superforecasters know that there is nothing new under the sun.  Nothing is 100% “unique.”… Superforecasters are in the habit of posing the outside-view question: How often do things of this sort happen in situations of this sort?

Strike the right balance between under- and overreacting to evidence.

The best forecasters tend to be incremental belief updaters, often moving from probabilities of, say, 0.4 to 0.35 or from 0.6 to 0.65, distinctions too subtle to capture with vague verbiage, like “might” or “maybe,” but distinctions that, in the long run, define the difference between good and great forecasters.

Yet superforecasters also know how to jump, to move their probability estimates fast in response to diagnostic signals.  Superforecasters are not perfect Bayesian updaters but they are better than most of us.  And that is largely because they value this skill and work hard at cultivating it.

Look for the clashing causal forces at work in each problem.

For every good policy argument, there is typically a counterargument that is at least worth acknowledging.  For instance, if you are a devout dove who believes that threatening military action never brings peace, be open to the possibility that you might be wrong about Iran.  And the same advice applies if you are a devout hawk who believes that soft “appeasement” policies never pay off.  Each side should list, in advance, the signs that would nudge them toward the other.

Now here comes the really hard part.  In classical dialectics, thesis meets antithesis, producing synthesis.  In dragonfly eye, one view meets another and another and another—all of which must be synthesized into a single image.  There are no paint-by-number rules here.  Synthesis is an art that requires reconciling irreducibly subjective judgments.  If you do it well, engaging in this process of synthesizing should transform you from a cookie-cutter dove or hawk into an odd hybrid creature, a dove-hawk, with a nuanced view of when tougher or softer policies are likelier to work.

Strive to distinguish as many degrees of doubt as the problem permits but no more.

Few things are either certain or impossible.  And “maybe” isn’t all that informative.  So your uncertainty dial needs more than three settings.  Nuance matters.  The more degrees of uncertainty you can distinguish, the better a forecaster you are likely to be.  As in poker, you have an advantage if you are better than your competitors at separating 60/40 bets from 40/60—or 55/45 from 45/55.  Translating vague-verbiage hunches into numeric probabilities feels unnatural at first but it can be done.  It just requires patience and practice.  The superforecasters have shown what is possible.

Strike the right balance between under- and overconfidence, between prudence and decisiveness.

Superforecasters understand the risks both of rushing to judgment and of dawdling too long near “maybe.”…They realize that long-term accuracy requires getting good scores on both calibration and resolution… It is not enough just to avoid the most recent mistake.  They have to find creative ways to tamp down both types of forecasting errors—misses and false alarms—to the degree a fickle world permits such uncontroversial improvements in accuracy.

Look for the errors behind your mistakes but beware of rearview-mirror hindsight biases.

Don’t try to justify or excuse your failures.  Own them!  Conduct unflinching postmortems: Where exactly did I go wrong?  And remember that although the more common error is to learn too little from failure and to overlook flaws in your basic assumptions, it is also possible to learn too much (you may have been basically on the right track but made a minor technical mistake that had big ramifications).  Also don’t forget to do postmortems on your successes too.  Not all successes imply that your reasoning was right.  You may have just lucked out by making offsetting errors.  And if you keep confidently reasoning along the same lines, you are setting yourself up for a nasty surprise.

Bring out the best in others and let others bring out the best in you.

Master the fine arts of team management, especially perspective taking (understanding the arguments of the other side so well that you can reproduce them to the other’s satisfaction), precision questioning (helping others to clarify their arguments so they are not misunderstood), and constructive confrontation (learning to disagree without being disagreeable).  Wise leaders know how fine the line can be between a helpful suggestion and micromanagerial meddling or between a rigid group and a decisive one or between a scatterbrained group and an open-minded one.

Master the error-balancing bicycle.

Implementing each commandment requires balancing opposing errors… Learning requires doing, with good feedback that leaves no ambiguity about whether you are succeeding… or whether you are failing… Also remember that practice is not just going through the motions of making forecasts, or casually reading the news and tossing out probabilities.  Like all other known forms of expertise, superforecasting is the product of deep, deliberative practice.

Don’t treat commandments as commandments.

Guidelines are the best we can do in a world where nothing is certain or exactly repeatable.  Superforecasting requires constant mindfulness, even when—perhaps especially when—you are dutifully trying to follow these commandments.

 

BOOLE MICROCAP FUND

An equal weighted group of micro caps generally far outperforms an equal weighted (or cap-weighted) group of larger stocks over time.  See the historical chart here:  https://boolefund.com/best-performers-microcap-stocks/

This outperformance increases significantly by focusing on cheap micro caps.  Performance can be further boosted by isolating cheap microcap companies that show improving fundamentals.  We rank microcap stocks based on these and similar criteria.

There are roughly 10-20 positions in the portfolio.  The size of each position is determined by its rank.  Typically the largest position is 15-20% (at cost), while the average position is 8-10% (at cost).  Positions are held for 3 to 5 years unless a stock approaches intrinsic value sooner or an error has been discovered.

The mission of the Boole Fund is to outperform the S&P 500 Index by at least 5% per year (net of fees) over 5-year periods.  We also aim to outpace the Russell Microcap Index by at least 2% per year (net).  The Boole Fund has low fees.

 

If you are interested in finding out more, please e-mail me or leave a comment.

My e-mail: jb@boolefund.com

 

 

 

Disclosures: Past performance is not a guarantee or a reliable indicator of future results. All investments contain risk and may lose value. This material is distributed for informational purposes only. Forecasts, estimates, and certain information contained herein should not be considered as investment advice or a recommendation of any particular security, strategy or investment product. Information contained herein has been obtained from sources believed to be reliable, but not guaranteed. No part of this article may be reproduced in any form, or referred to in any other publication, without express written permission of Boole Capital, LLC.

2 thoughts on “Superforecasting”

  1. Hmm is anyone else having problems with the pictures on this blog loading?
    I’m trying to determine if its a problem on my end
    or if it’s the blog. Any feed-back would be greatly appreciated.

Comments are closed.