In 2024, several new benchmarks have been introduced to evaluate Retrieval-Augmented Generation (RAG) systems across various domains and tasks:
I hope you meant same.
RAGBENCH: This benchmark offers a comprehensive dataset of 100,000 examples spanning five industry-specific domains. It introduces the TRACe evaluation framework, providing explainable and actionable metrics for assessing RAG systems.arXiv
Multihop-RAG: Designed to assess RAG systems' ability to handle multi-hop queries, which require retrieving and reasoning over multiple pieces of evidence. This dataset includes a knowledge base and a large collection of multi-hop queries with their corresponding answers and supporting evidence.arXiv
Legalbench-RAG: Focused on the legal domain, this benchmark evaluates how effectively retrieval mechanisms can identify precise legal references, offering a more detailed measure of performance in legal contexts.arXiv
CRUD-RAG: A comprehensive Chinese benchmark that categorizes RAG applications into four types—Create, Read, Update, and Delete. It provides datasets for each category to evaluate RAG systems across diverse application scenarios.arXiv
Ragnarök: Serving as a reusable RAG framework, Ragnarök offers baselines for the TREC 2024 Retrieval-Augmented Generation Track. It aims to standardize the evaluation of RAG systems and includes a web-based interface for interactive benchmarking.arXiv