Methodology of evaluating quality of responses of a RAG system using LLM as a judge

Project: 21016 DAIsy
Type: Enhancement
Description: The methodology assesses RAG answer quality using LLM-as-a-judge prompts, focusing on groundedness and completeness as core reliability criteria. Answers are evaluated strictly against the retrieved context used for generation. Groundedness is checked at the sentence level and aggregated conservatively using a worst-sentence approach, including detection of incorrect cross-chunk combinations. Completeness measures whether all relevant context sentences are covered. A human-annotated meta-evaluation dataset validates alignment between LLM and human judgments.
Contact: Martijn Krans
Email: martijn.krans@philips.com
Research area(s): Generative AI and LLMs
Technical features: The evaluation framework relies entirely on prompt-based LLM judges for groundedness and completeness with respect to the retrieved context. Specialized prompts detect unsupported statements, missing essential information, and contextually invalid combinations of instructions across retrieved chunks. The approach favors interpretability, flexibility, and domain adaptability, allowing rapid prompt refinement as chatbot behavior evolves.
Integration constraints: Solutions that use LLMs
Targeted customer(s): Philips and any industry developing chatbots
Conditions for reuse: Originally to be used internally, licensing to be considered
Confidentiality: Public
Publication date: 27-01-2026
Involved partners: Philips Electronics Nederland BV (NLD)