I am working on an event detection from speech task. We evaluated our method in terms of recall, precision, and F-score. Having calculated the area under curve (AUC) for receiver operating characteristic (ROC) curve, we got a surprise: For F-score of 0.49 (precision: 0.39, recall: 0.66) we got the AUC of 0.88. For another method, which provided us with F-score of 0.6, we got AUC of 0.94.
Such a high AUC values are quite suspicious given relatively low F-measures. We checked our code, but everything looks correct. Does anybody have any experience with such kinds of measures? Is it reasonable ratio between the F-score and the AUC?