Machine learning is a complex, iterative design and development process aimed at creating a learned model that generalizes to new data inputs. Model evaluation, which involves testing and analyzing the performance of a model on retained test datasets with known labels, is an important step. Due to the breadth of machine learning applications today, interactive data visualization has proven to be a useful tool to help humans understand model performance.
The confusion matrix is a tabular layout that contrasts a predicted class label with the actual class label for each class in all sample data. This is a common visualization used for model evaluation, especially for classification models. The rows of the confusion matrix indicate the actual class labels, while the columns represent the expected class labels in a typical layout (synonymously, these can be reversed via matrix transposition).
Many machine learning courses teach these visuals, which are then used in practice to show which pairs of classes a model confuses. In summary, confusion matrices are the reference graph for evaluating classification models.
Despite their widespread use, traditional confusion matrices have a number of usability issues. Confusion matrices display a visual approximation of accuracy (e.g., entries on the diagonal of the matrix), which has proven insufficient for many assessments. Also, the diagonal of a confusion matrix often contains many more occurrences (by orders of magnitude) than the off-diagonal entries, masking critical misunderstandings (i.e. off-diagonal entries).
As practitioners improve their model, the net impact shifts cases off-diagonal to diagonal, increasing the problem of confusion concealment. Surprisingly, the better the model optimization, the harder it is to discover errors. When a dataset has many classes, high class imbalance, hierarchical structure, or multiple outputs, confusion matrices suffer from scaling issues.
Apple researchers conducted a formative research study with machine learning practitioners to better understand the limitations of using confusion matrices. In many machine learning applications, the team found that confusion matrices are difficult to use at scale, do not provide other metrics that model practitioners need to know (e.g., precision, recall), and are difficult to share. Also, confusion matrices only allow single-label flat data structures; hierarchical labels and multi-output labels, which are more sophisticated but still common, are not supported.
The team developed a confusion matrix algebra, which represents confusion matrices as probability distributions and gives a unified solution to the shortcomings of conventional confusion matrices, based on the results of formative research and a review of literature. They designed and built NEO, a visual analysis solution that allows practitioners to create and interact with confusion matrices in a variety of configurations with increasingly complex label structures based on this algebra.
NEO’s design expands the confusion matrix by allowing users to see additional metrics for analytical context, interactively analyze model confusions through alternative normalization methods, see hierarchical and multi-output labels and easily swap confusion matrix settings with others. NEO uses a traditional confusion matrix as the basis for the display, keeping the recognizable format of confusion matrices.
The main contributions of this article are:
• Surveying machine learning practitioners at Apple on how confusion matrices and model evaluation visualizations are used in practice produced formative research, which included typical problems and analysis tasks.
• Confusion matrix algebra is a type of probability distribution algebra that generalizes and models confusion matrices.
• NEO is a visual analysis solution that enables hierarchical and multi-output labels for construction and engagement with confusion matrices. NEO also offers the spec (or “spec”), which allows users to share specific visualizations with others. In the same way that publishing a spec updates the visualization, interacting with the visualization updates the spec, NEO is responsive.
• Three model evaluation scenarios, including object detection, large-scale image classification, and multi-output online toxicity detection, show how NEO can help practitioners evaluate learning models automatically in different modeling domains and tasks.
Confusion matrices have a “standardized” expected and boundary representation, whereas many machine learning visualizations do not. Rather than redesigning the Confusion Matrix visualization, NEO’s key design goal was to leverage its familiarity and enhance its functionality with additional views and interactivity. NEO displays a standard confusion matrix in the simplest case, where a practitioner has a classification model with a dataset whose instances have no hierarchy and a single class label. Even under these circumstances, however, there is still room for improvement in model evaluation through interaction.
Svelte, TypeScript and D3 are used to create NEO, a modern web system. The specification is implemented in a portable JSON format so confusion matrix configurations can be easily shared with other stakeholders. In terms of system scalability, NEO is limited by standard SVG browser limits (eg displaying tens of thousands of SVG components). Although the researchers believe that better interactions for configuring confusion matrices to compare relevant classes and sub-matrices are more useful to practitioners than making the largest matrix possible, they believe that engineering efforts such as using Canvas or WebGL would remove this constraint.
The capabilities of confusion matrices are generalized in this article while preserving their familiar representation. The team developed an algebra that represents confusion matrices as probability distributions and expresses more variations of confusion matrices, such as datasets with hierarchical and multi-label labels, through formative research. The team used this algebra to create NEO, a visual analytics solution that allows practitioners to create, interact with, and share confusion matrices in a variety of ways. Finally, the researchers show the utility of NEO by using three model evaluation situations to help humans better understand model performance and uncover hidden ambiguities.