Methodology of evaluating quality of responses of a RAG system using LLM as a judge
- Project
- 21016 DAIsy
- Type
- Enhancement
- Description
The methodology assesses RAG answer quality using LLM-as-a-judge prompts, focusing on groundedness and completeness as core reliability criteria. Answers are evaluated strictly against the retrieved context used for generation. Groundedness is checked at the sentence level and aggregated conservatively using a worst-sentence approach, including detection of incorrect cross-chunk combinations. Completeness measures whether all relevant context sentences are covered. A human-annotated meta-evaluation dataset validates alignment between LLM and human judgments.
- Contact
- Martijn Krans
- martijn.krans@philips.com
- Research area(s)
- Generative AI and LLMs
- Technical features
The evaluation framework relies entirely on prompt-based LLM judges for groundedness and completeness with respect to the retrieved context. Specialized prompts detect unsupported statements, missing essential information, and contextually invalid combinations of instructions across retrieved chunks. The approach favors interpretability, flexibility, and domain adaptability, allowing rapid prompt refinement as chatbot behavior evolves.
- Integration constraints
Solutions that use LLMs
- Targeted customer(s)
Philips and any industry developing chatbots
- Conditions for reuse
Originally to be used internally, licensing to be considered
- Confidentiality
- Public
- Publication date
- 27-01-2026
- Involved partners
- Philips Electronics Nederland BV (NLD)