A new benchmark evaluates large language models on their ability to answer real-world consumer device repair questions. The study found that while LLMs can offer some assistance, they are unreliable for high-risk tasks, particularly in phone repair, due to errors in diagnosis and safety procedures. GPT-5.4 performed best among the six evaluated models, though performance in Bangla was consistently worse than in English. AI
IMPACT Highlights the need for safety safeguards and specialized evaluation for LLMs in high-risk, real-world applications.
RANK_REASON The cluster contains an academic paper introducing a new benchmark and evaluation of LLMs on a specific task.
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →