PulseAugur
EN
LIVE 12:07:58

LLM dialogue agents show low accuracy in NFR assessment, impacting user satisfaction

A new research paper explores the effectiveness of LLM-based dialogue assistants in assessing Non-Functional Requirements (NFRs) for software development, particularly in the context of HIPAA compliance. The study involved 49 programmers interacting with GitHub Copilot to evaluate 148 HIPAA-derived NFRs against the iTrust codebase. Findings indicate that while developers often agree with the LLM's assessments, the actual accuracy against expert ground truth is low. Furthermore, user satisfaction is negatively impacted by longer system responses and more information-providing turns, while proactive interactions tend to improve it. AI

IMPACT Highlights limitations in current LLM dialogue agents for critical NFR assessment, suggesting a need for improved interaction design to boost accuracy and user satisfaction.

RANK_REASON The cluster contains a research paper detailing findings on LLM accuracy and user satisfaction in a specific software development context.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

LLM dialogue agents show low accuracy in NFR assessment, impacting user satisfaction

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Ali Pourghasemi Fatideh, Wilder Baldwin, Maria Dhakal, Collin McMillan, Sepideh Ghanavati ·

    Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment

    arXiv:2606.24834v1 Announce Type: new Abstract: LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness. This leaves a critical gap in assessing the quality and accuracy of t…

  2. arXiv cs.AI TIER_1 English(EN) · Sepideh Ghanavati ·

    Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment

    LLM-based dialogue assistants have become mainstream tools for software developers, yet current evaluation benchmarks focus exclusively on functional correctness. This leaves a critical gap in assessing the quality and accuracy of these conversations when handling Non-Functional …