What is the best way to measure the performance of classifiers when multi-class data is under consideration? I've already practiced some in my research works. For example, I've used macro-average and micro-average ROC curve analysis for multi-class balanced and imbalanced data respectively. The confusion matrix display is another simple way to showcase the performance. However, it can be messy sometimes. For example, in one research we had 109 classes, thus, a confusion matrix of size 109*109 which is huge! I've also considered finding out class-wise performance (precision, recall, specificity, false-positive rate, f-score etc.) and then calculating the average or weighted average value for these parameters. Are these methods enough to measure the performance of a classifier for multi-class data? Or there are some more capable ways to measure the performance?