PulseAugur
EN
LIVE 12:58:26

New benchmark reveals LLMs struggle with hidden social norms in planning

Researchers have introduced NormAct, a new benchmark designed to evaluate how well multimodal large language models (MLLMs) can adhere to hidden social norms in embodied planning tasks. Experiments using GPT-5.4, Claude Opus 4.7, and Gemini 3 Pro revealed that while these models can achieve explicit goals, they struggle significantly with implicit social compliance, succeeding only 26.4% of the time. To address this, the proposed NormPerceptor system helps models infer and apply relevant norms, improving overall task success from 24.2% to 46.7%. AI

IMPACT Highlights a critical gap in LLM reasoning for embodied agents, potentially impacting the development of safer and more socially aware AI systems.

RANK_REASON The cluster describes a new academic benchmark and proposed method for evaluating LLM behavior, published on arXiv.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New benchmark reveals LLMs struggle with hidden social norms in planning

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Shiyun Zhao, Xinwei Song, Tianyu Guo, Xiaomeng Gao, Mingyuan Liu, Xu Han, Yuanyuan Zhang, Zhenliang Zhang, Xue Feng, Bo Dai ·

    NormAct: A Benchmark for Hidden Social Norm Compliance in Embodied Planning

    arXiv:2606.27826v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) are increasingly deployed as embodied planners in egocentric environments, where task success requires not only achieving instructed goals but also acting in socially appropriate ways. While …

  2. arXiv cs.AI TIER_1 English(EN) · Bo Dai ·

    NormAct: A Benchmark for Hidden Social Norm Compliance in Embodied Planning

    Multimodal large language models (MLLMs) are increasingly deployed as embodied planners in egocentric environments, where task success requires not only achieving instructed goals but also acting in socially appropriate ways. While explicit goals may render certain actions optima…