Prediction and Calibration (Prediction 3)

How heavy is the heaviest blue whale? What is the GDP of Chile? What time is sunrise tomorrow morning? You might say: I don’t know, and anyway, who cares?

In fact, practicing with trivial questions like these is a great way to improve your epistemic rationality. If you can learn to answer questions like this, then you can use that skillset to answer more importan questions such as:

  • What are the chances of my company’s product succeeding in the market?
  • What is most likely wrong with my car?
  • Would I be better off working at a different company?

The book Superforecasting: The Art and Science of Prediction by Philip Tetlock introduces a series of tools to improve prediction. Through his systematic study of human judgement, Tetlock shows that practice is more important that intelligence, at least when it comes to making well-calibrated predictions. In fact, practice can even be more important than having more information about the problem! Talented forecasters working with publically available information often outperformed CIA analysts with access to classified information.

It’s hard to overemphasize how important practice is. Being well-calibrated is synonymous with having strong epistemic rationality.

What does “well-calibrated” mean, technically? Calibration refers to your confidence level being in line with your error rate. If you say, “I’m about 70% sure that it’s going to rain today,” then you’re expressing a 70% confidence level in that prediction. If you make a variety of predictions at the 70% confidence level then you should be wrong 30% of the time.

But if every time you predict 70% odds, and that set of events actually happens 90% of the time, then you are underconfident. If you predict 70% odds and the event only happens 50% of the time, you are overconfident. The goal is to be precisely confident - when you say something is 70% likely to happen, then it should happen 70% of the time. The process is one of learning to turn your sophisticated mental models, and your gut feelings, into a single accurate and useful number.

Another reason that we should learn to think in terms of numerical probabilities — to turn our gut feelings into numbers like 70% — is that we can then share that number with other people.

You might think that it’s good enough to tell somebody that you’re “reasonably sure” that you didn’t leave the stove on, rather than saying you think the likelihood is 90%. But there is in fact no generally agreed upon definition for words like "reasonably certain", "probable", "certain", and "likely".

In a survey, respondents gave definitions for "reasonably certain" that varied between 35% and 95%. Even "proved," which I would think of as a synonymous for almost undoubted certainty, means "65% likely" to some people, apparently. If somebody told you that they had proved something, would you assume they meant that thing was only 65% likely to be true?

To avoid communication breakdowns, use numbers to communicate probability.

Let’s return to our task of coming up with well-calibrated estimates, and try to apply these ideas to the blue whale problem we mentioned a moment ago. The question was: How heavy is the heaviest blue whale? The following was my actual thought process as I went through the process of answering this question. The aim here is not necessarily to show a correct, perfect way to go about making a prediction, but to illustrate the basic principles.

The first order of business in all prediction tasks is to clarify the exact meaning of the question, and what qualifies as an answer. For the sake of simplicity, let’s state that whatever weight that Wikipedia provides as “largest” or “heaviest” blue whale will be the accepted answer.

Feel free to stop here and go make your own estimate before returning to see my approach.

Next I put forth a gut-based approach. My intuition says that the answer falls between 25 and 100 tons, or 50,000 and 200,000lb, with about 80% confidence. I tend to lay down an initial off-the-cuff estimate before diving in to make a more precise guess using better tools.

Note. Remember that an 80% confidence indicates that about one in five predictions should be wrong.

Next I’m going to make a separate and independent estimate based on an analogy to a similar animal, using my favorite website, getguesstimate, to keep track of my probability distributions.

I would guess that healthy adult blue whales are no longer than 120ft and no smaller than 50ft. That doesn’t help much, in and of itself, but it’s a starting point. Here I am going to employ a technique of analogy and comparison to a thing that I feel more confident about. I think that a bottlenose dolphin weighs between 250lb and 400lb, and is about 6-8ft long. I base this on having seen them a few times and knowing roughly how heavy similar-sized animals are. I am, at least, much more confident about the weight of dolphins than I am about blue whales. I have, in other words, picked something that I am familiar with and understand moderately well, to make inferences about something I am very uncertain about.

If we simply scale up the biology of a bottlenose dolphin to be the size of a blue whale, you would expect the giant dolphin to have about 2800 times as much mass, according to my calculation. This is because of the fact that the animal’s volume scales with the square of its radius times its length, which is more or less a cubic relationship.

Whales are “thinner” than dolphins in profile, perhaps 40-60% as wide as dolphins relative to their length. So I impose a factor which reduces the mass by this factor, too.

whale

distribution

If I multiply these distributions together, Guesstimate suggests that based on these assumptions we should assign a median expectation of about 390,000lb, meaning we expect there’s a 50% chance it’s lower than that and a 50% chance it’s higher than that. In order to give ourselves a scorable prediction — a prediction that can be judged right or wrong — we say that we are 50% confident that it’s higher than 390,000lb.

We finally look up the answer and find that the largest blue whale is 199 tons, or about 400,000lb. It turns out our median guess was pretty dang close!

If we make a whole bunch of similarly accurate median predictions, then we will be “right” 50% of the time and “wrong” 50% of the time, which means we are very well calibrated at the 50% point. In other words, we’d be really good at guessing the median value. If we’re wrong half the time on predictions with a similar confidence level, we still “win” because we’re well-calibrated, and being well-calibrated is the goal here.

But we need to hold off before we give ourselves credit for being well-calibrated. You may have wondered why I chose to use probability distributions rather than point-estimates. Distributions allow us to inspect the implications of our uncertainty, and to interrogate our own understanding after we have checked our answer.

The Guesstimate model suggests that, based on our assumptions, we believed there was a 20% chance that the animal would be bigger than 600,000lb, and similarly a 20% chance that it would be less than 243,000lb. If the original question had been “what are the odds that the largest blue whale weighs more than 600,000lb?” the model would have driven you to answer “20%” which seems very high now that you know the correct answer is only ⅔ of that. If you made a number of similar predictions about animal sizes using similar assumptions, does it seem reasonable that you would be wrong only 20% of the time? So we need to care both about the median value and the shape of the probability distribution. Meaning we need to think harder about the minima and maxima of our estimates, not just our final median result.

One thing that we could do to sharpen up not just the accuracy of our median estimate but also the robustness of our distribution is to make several estimates based on several different methodologies and average them together. You can then weight the averaging of these various distinct answers by your level of confidence in each approach. For example, if we decided to weight in our initial guess — that the whale was unlikely to weigh more than 200,000lb — we would actually make our estimate worse. But we know that was a low quality, off-the-cuff guess, so we (implicitly) didn’t give it very much weight. Or maybe you happened to remember some other fact, such as, a blue whale can weigh as much as 15 school buses. Then you can try to figure out how much a school bus might weigh, or even look up how much a school bus weighs if it’s actually important.

Another nice thing about building models such as this one is that we can check the validity of our assumptions in detail. For example, it turns out the largest measured blue whale was about 108ft long, placing its length higher than our median guess, but still falling within our range. On the other hand, we were badly wrong about how big bottlenose dolphins are (they can grow up to 12 feet!) If we had thought harder about the high and low ends of our ranges, our ultimate guess might have been better calibrated.

But, of course, this was only one prediction. We can’t really say whether we’re well-calibrated from one prediction; we can only say whether we were right or wrong this time. Thus, you need to collect a fairly large number of predictions across a wide variety of topics to learn your own calibration accuracy.

A final note: Remember, our aim is to master practical decision-making. The minutiae of decision theory are only valuable insofar as they help you. In your day-to-day life, estimating the true weight of a blue whale is not likely to be important. If it comes up in conversation, go with your off-the-cuff guess.

Furthermore, if it turns out to be important... you could — and should — look it up. A sophisticated forecasting procedure is only called for if the question is both important and uncertain. Thus, we practice our skills on these simpler, safer problems, so that they’re sharp when the harder problems, which we can't look up, present themselves.

Join the Guild

Flower
$15/month
Power up your life.
Unlimited workshops per month
Full Discord access
One cohort*
30 days free trial
No credit card required 
Rose Bush
$50/month
Become a Guild insider.
Unlimited workshops per month
Full Discord access
One cohort*
One one-hour meeting with a Council member per month
Gardener
$500/month
Fill up on warm fuzzies.
Unlimited workshops per month
Full Discord access
One cohort*
One one-hour meeting with a Council member per month
Permanent access to the Gardener title on Discord**
* Cohorts are groups of four to six classmates that form a permanent study group.
** The Gardener prestige title will not be revoked if your membership expires or is downgraded.