ITEA 4 page header azure circular

Methodology of evaluating quality of responses of a RAG system using LLM as a judge

Project
21016 DAIsy
Type
Enhancement
Description

The methodology assesses RAG answer quality using LLM-as-a-judge prompts, focusing on groundedness and completeness as core reliability criteria. Answers are evaluated strictly against the retrieved context used for generation. Groundedness is checked at the sentence level and aggregated conservatively using a worst-sentence approach, including detection of incorrect cross-chunk combinations. Completeness measures whether all relevant context sentences are covered. A human-annotated meta-evaluation dataset validates alignment between LLM and human judgments.

Contact
Martijn Krans
Email
martijn.krans@philips.com
Research area(s)
Generative AI and LLMs
Technical features

The evaluation framework relies entirely on prompt-based LLM judges for groundedness and completeness with respect to the retrieved context. Specialized prompts detect unsupported statements, missing essential information, and contextually invalid combinations of instructions across retrieved chunks. The approach favors interpretability, flexibility, and domain adaptability, allowing rapid prompt refinement as chatbot behavior evolves.

Integration constraints

Solutions that use LLMs

Targeted customer(s)

Philips and any industry developing chatbots

Conditions for reuse

Originally to be used internally, licensing to be considered

Confidentiality
Public
Publication date
27-01-2026
Involved partners
Philips Electronics Nederland BV (NLD)