Two new research papers highlight significant issues with current methods for evaluating AI-generated text. One paper reveals widespread under-reporting of human evaluation protocols in NLP conferences, hindering reproducibility and clarity. The second paper critiques the common use of generative perplexity for non-autoregressive models, arguing it can be 'hacked' to produce incoherent text while appearing to perform well. Both studies call for more robust and transparent evaluation metrics and methodologies. AI
IMPACT Highlights critical flaws in current AI text evaluation, potentially leading to more reliable benchmarks and model development.
RANK_REASON Two academic papers published on arXiv discussing fundamental issues with AI text evaluation metrics and protocols.
- Continuous flow-based language models
- Diffusion models
- generative perplexity
- gpt2-large
- LM1B
- non-autoregressive language models
- OpenWebText
- arXiv
- CL conference publications
- long-form text generation
AI-generated summary · Google Gemini · from 4 sources. How we write summaries →