A new study analyzed 14,727 security and privacy prompts from the WildChat dataset, revealing that users frequently seek advice on protecting themselves online. Commercial large language models, such as GPT 5.5, demonstrated superior performance, providing adequate responses for 98% of prompts, compared to open-weight models like Llama 4, which succeeded on only 47%. Despite high average response quality, commercial models sometimes offered contradictory advice across different runs, potentially misleading users. AI
IMPACT Commercial LLMs show higher reliability in security advice, but consistency issues remain a concern for user safety.
RANK_REASON Research paper published on arXiv detailing analysis of LLM prompts and responses.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →