In a classic Hollywood “wire dilemma,” two heroes faced with a ticking time bomb must make a decision: Which wire should be cut to disarm the bomb. Red or blue? A debate ensues, a last-second decision made, and the day is saved.
Making decisions under pressure with limited information is a skill valued by action movie fans and employers alike. But in the real world, the counterfactual isn’t always as clear as an exploding bomb. So how do we know whether a split second decision is a bold act of brilliance, or a disastrous lapse of judgement, or simple luck?
In statistics that determination is a measure of inter-rater reliability, symbolized by the Greek letter Kappa. The Kappa statistic has been used to determine whether the split-second actions of a pilot were justified, or if treatment of a patient by a physician amounted to criminal negligence. In this post, I'll discuss how Kappa is used to determine whether a decision was a reliably decisive act, or little more than a random guess.
The Kappa statistic is expressed as a number between 0 to 1 (although it can be negative). Anything over 0.60 is generally accepted as a reliably “non random” level of agreement. A great free resource for calculating inter-rater reliability can be found at: http://dfreelon.org/utils/recalfront/. On the site you’ll see the kind of data needed to calculate Kappa, and a few methods for calculating it.
The most common measure of Kappa is “Fleiss' Kappa”, used to assess the reliability of two or more decision makers. It can also incorporate weighting to account for the relative strength of disagreement. When calculating Kappa, it is often helpful to cross-reference a second statistic. Alpha is a similar measure but accounts for observed and expected disagreements. If we determine that both Kappa and Alpha are within an acceptable range, we can say that raters are decidedly reliable in their decision-making.
The details of the calculation are helpful in order to understand some of the limitations of Kappa. Kappa is equal to the difference between the true rate of agreement and the expected rate of agreement. This difference is divided by one minus the expected agreement (agreement beyond what is expected due to random chance). The expected agreement is statistically derived and is based on the sample size (n). The tables are the same as those used to conduct a chi-squared test. This equation is for calculating Kappa for two raters. The calculation for Fleiss' Kappa is similar but has additional calculations to account for multiple raters.
A couple notes of caution. Very rare observations can mess with Kappa, especially Fleiss'. For example, if we wanted to see how reliably a radiologists positively identify a rare disease, the rate of agreement is typically very high as a percentage of total observations because the vast majority of the time, they do not see anything of concern. Due to the way expected agreement % is calculated in Kappa, a very high rate of agreement can drive up expected agreement, which in turn causes Kappa to plummet if a disagreement is detected.
There are some great examples of this when IBM's Watson AI attempted to diagnose rare diseases. In this situation, it is helpful to check the expected agreement %. If it is very high, it might make sense to use a different measure. Alpha is a good alternative because it accounts for expected/observed disagreement. For inter-rater assessments, Krippendorff's alpha would be the one to use.