My RAG's faithfulness was 0.67. 1 in 3 answers were still wrong.
A developer evaluated their on-premise Retrieval-Augmented Generation (RAG) system, finding that a faithfulness score of 0.67 masked a significant issue: one-third of the answers were factually incorrect despite being grounded in the retrieved context. Adding a reranker improved precision but did not address the core problem of low context recall, which was identified as the primary bottleneck. The developer concluded that faithfulness alone is an insufficient metric, advocating for a combined evaluation of answer correctness and context recall to ensure system accuracy. AI
IMPACT Highlights the limitations of standard RAG evaluation metrics, suggesting a need for more robust correctness checks to prevent deploying inaccurate AI systems.