Roughly and succinctly speaking (depending of course on the test, training and validation scheme you employ), things like std. dev. can be used as a rudimentary measure of classifier stability. If you you are using stratified k-fold cross validation (depending on how it is implemented), then the mean *could* represent the mean of the aggregated correct classification results over k models (there are other ways to interpret it of course). The std. dev would then give an indication as to how these varied across the different models. If you are not interested in outright classification performance, but merely want to compare the stability of two different learners (again in a rather rudimentary fashion), you could use something like relative std. dev. = std dev/correct classification percentage.
The degree to which mean, std.dev., and any derived metric is actually meaningful in this context however will depend largely on the type of test, training and validation scheme you employ.