|This article is work in progess. The author is working on it and it is not yet ready for review.|
Calibration describes the statistical consistency between a probabilistic forecast and the observed values . One can distinguish different forms of calibration, most importantly probabilistic calibration, marginal calibration and exceedance calibration. Among these, probabilistic calibration is by far the most used.
Probabilistic calibration refers to the propensity of a forecaster's forecasts to occur at the approximate frequency of their prediction. For example, a forecaster who forecasts 10 events at 40% each and 4 of those events ultimately occur exhibits good calibration. If 3 or 5 of these events occur, the forecaster may still be exhibiting reasonable calibration and merely have been slightly unlucky. Greater deviations indicate that the forecaster is less calibrated. Accurately judging a forecaster's calibration requires the resolution of many forecasts across the spectrum of probabilities.
One common approach for doing so visually is the Calibration Plot. Calibration plots are, roughly, vertical box-and-whisker diagrams showing the distribution of resolution frequencies for a given forecaster's track record. For example:
Here we can see a clear correlation between Metaculus' predictions and the resolutions. Moreover, we can see that error isn't systematically consistent (i.e., boxes aren't consistently above or below the dotted "perfect calibration" line), meaning we can't use a simple linear correction to improve upon Metaculus' forecasts.
-  Gneiting, T., Balabdaoui, F. and Raftery, A.E. (2007), Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69: 243-268. https://doi.org/10.1111/j.1467-9868.2007.00587.x