Generally, if there isn't too many categories, I would recommend to construct N x N contingency table. A compression of performance into single metrics isn't easy, however, some applicable metrice can be cound in the following article written by Pierre Baldi in 2000 (see section "More than two classes"):