Precision-based ranked retrieval evaluation metrics from information retrieval (IR) such as Precision@k (P@k), AveragePrecision@k (AP@k), and MeanAveragePrecision@k (MAP@k) employ only oracle or user-assessed relevancy scores while completely discarding the system-generated relevancy scores. However, system-generated relevancy scores are the ones that are used for ranking the retrieval outcome.

Is it right to discard the system-generated scores (known as similarity scores in case-based reasoning (CBR)) in evaluation metrics?

Let's consider two variants (A and B) of a CBR system with identical case representation, case base, and retrieval output. However, the retrieval results differ by their system-generated scores based on which the retrieval ranking is performed. Say, the oracle-assessed relevancy scores and system-generated relevancy scores for the top 3 ranks for A and B are:

  • A: oracle-assessed relevancy (0.9, 0.8, 0.7) and system-generated relevancy (0.9, 0.8, 0.7)
  • B: oracle-assessed relevancy (0.9, 0.8, 0.7) and system-generated relevancy (0.5, 0.3, 0.2)

Note: The metrics (P@k, AP@k, and MAP@k) by design operate only with binary relevancies, which means oracle-assessed relevancies for A and B can be (1, 1, 1) for the current example, where 1 is relevant and 0 irrelevant.

Question:

  • Now, which CBR system is more fair or reliable?
  • Should we choose system A or B?

More Amar Jaiswal's questions See All
Similar questions and discussions