MORTAR system automates multi-turn testing for LLM dialogue systems

By PulseAugur Editorial · [1 sources] · 2026-06-18 04:00

Researchers have introduced MORTAR, a novel approach for testing Large Language Model (LLM)-based dialogue systems that specifically addresses the challenges of multi-turn interactions. Unlike previous methods that focused on single-turn testing, MORTAR tackles the oracle problem inherent in multi-turn conversations by automating the generation of dialogue test cases with various perturbations and metamorphic relations. This automated system does not rely on LLM judges and has demonstrated significantly higher bug detection rates, revealing over 150% more bugs per test case compared to single-turn testing baselines. The approach also yields higher quality bugs in terms of diversity, precision, and uniqueness, offering a more comprehensive evaluation method for dialogue systems. AI

IMPACT Enhances the quality assurance process for conversational AI, potentially leading to more robust and reliable dialogue systems.

RANK_REASON Research paper detailing a new methodology for testing LLM dialogue systems. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Aaron Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen · 2026-06-18 04:00

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

arXiv:2412.15557v4 Announce Type: replace-cross Abstract: With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in si…

COVERAGE [1]

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

RELATED ENTITIES

RELATED TOPICS