MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems
Researchers have introduced MORTAR, a novel approach for testing Large Language Model (LLM)-based dialogue systems that specifically addresses the challenges of multi-turn interactions. Unlike previous methods that focused on single-turn testing, MORTAR tackles the oracle problem inherent in multi-turn conversations by automating the generation of dialogue test cases with various perturbations and metamorphic relations. This automated system does not rely on LLM judges and has demonstrated significantly higher bug detection rates, revealing over 150% more bugs per test case compared to single-turn testing baselines. The approach also yields higher quality bugs in terms of diversity, precision, and uniqueness, offering a more comprehensive evaluation method for dialogue systems. AI
IMPACT Enhances the quality assurance process for conversational AI, potentially leading to more robust and reliable dialogue systems.