One of the popular metrics to evaluate a binary classifier is F1 score and its variants. Technically, F1 score is defined as the harmonic mean of precision and recall. However, I often wondered what it means. The description failed to explain:
- Why only precision and recall?
- Why harmonic mean?
- Why its called F1?
Below is an attempt to explain the first two questions.
1. Why only precision and recall?
Often when dealing with a classification problem, we are interested in positive cases and our trained model’s ability to detect them. In other words, we are interested in true positives (TP). However, the number of true positives by itself is meaningless. For instance, if I say there are 50 TP; it doesn’t make any sense. We need a reference point to make true positive more meaningful. For instance, we can say that the model can detect 50 of 200 positive cases or 25% of actual positive cases.
However, there is a challenge. What should be the normalizing constant? There are two alternatives: one is the number of actual positives in our dataset, and another is the number of predicted positives. Using the number of actual positives as the normalizing constant gives true positive rate from the data perspective and is called as recall. Based on confusion matrix notations, it can be written as:
In contrast, using the number of predicted positives as the normalizing constant gives true positive rate from the model perspective and termed as precision. It can be written as:
2. Why harmonic mean?
A good metric usually has three qualities: it should be real-valued, bounded and monotonically increasing with the increasing quality of a model. However, above we have two different metrics to evaluate a model’s performance to detect positive cases. The challenge is then how do we combine the two to construct a single numerical metric. One solution is to use an average. However, there are many various kinds of averages such as “arithmetic,” “geometric” and “harmonic” mean. When dealing with ratios, as in our case, harmonic mean tends to make much more sense. Hence, F1 score is described as a harmonic mean of precision and recall and can be written as
3. Why the metric is known as F1?
The name consists of two parts: the letter “F” and the number 1. I am not sure why the letter “F” is used to communicate the metric. I don’t also know how/why 1 is used but it make sense if we look at the general form of score:
I still don’t understand how the general form of F score was derived. Something to still explore