Confusion matrix is one of the many ways to analyze accuracy of a classification model. As show in the table below, a confusion matrix is basically a two dimensional table with two axes. On one axis it has actual or target categories and on the other it contains predicted categories. Diagonal cells
indicate true positives i.e. number of test cases that were correctly predicted by the model. For instance, in the table below the model corrected predicted 2 out of 11 (or 18%) actual A’s as A. Non diagonal elements indicate false positives or true negatives i.e. number of test cases that were incorrectly predicated by the model to belong to a different category.
While the above confusion matrix is insightful, it only works when you few limited categories. However, while working on a problem I had more than 20 categories and visualizing a series of numbers across the table and making sense of them was an arduous task. So I started loooking for a way to visualize the confusion matrix. After exploring possible visualization techniques, I came with the idea of using heatmap. Luckily it was easy to produce heatmap in R using excellent ggplot library. (If you haven’t played with ggplot, try it right now. Its great !).
Below is the final output. In the figures below, color indicates percentage of test cases in that cell. The diagonal cells are highlighted with darker black border. Additionally, the two variants of the image display actual percentage value as text. if you like, it is possible to replace percentage text with actual frequency.
#generate random data data = data.frame(sample(LETTERS[0:20], 100, replace=T),sample(LETTERS[0:20], 100, replace=T)) names(data) = c("Actual", "Predicted") #compute frequency of actual categories actual = as.data.frame(table(data$Actual)) names(actual) = c("Actual","ActualFreq") #build confusion matrix confusion = as.data.frame(table(data$Actual, data$Predicted)) names(confusion) = c("Actual","Predicted","Freq") #calculate percentage of test cases based on actual frequency confusion = merge(confusion, actual, by=c("Actual")) confusion$Percent = confusion$Freq/confusion$ActualFreq*100 #render plot # we use three different layers # first we draw tiles and fill color based on percentage of test cases tile <- ggplot() + geom_tile(aes(x=Actual, y=Predicted,fill=Percent),data=confusion, color="black",size=0.1) + labs(x="Actual",y="Predicted") tile = tile + geom_text(aes(x=Actual,y=Predicted, label=sprintf("%.1f", Percent)),data=confusion, size=3, colour="black") + scale_fill_gradient(low="grey",high="red") # lastly we draw diagonal tiles. We use alpha = 0 so as not to hide previous layers but use size=0.3 to highlight border tile = tile + geom_tile(aes(x=Actual,y=Predicted),data=subset(confusion, as.character(Actual)==as.character(Predicted)), color="black",size=0.3, fill="black", alpha=0) #render tile