Bleu
PulseAugur coverage of Bleu — every cluster mentioning Bleu across labs, papers, and developer communities, ranked by signal.
-
New DiffCap-Bench benchmark evaluates multimodal LLMs on image difference captioning
Researchers have introduced DiffCap-Bench, a new benchmark designed to evaluate image difference captioning capabilities in multimodal large language models. This benchmark addresses limitations in existing datasets by …
-
RAG+prompt system boosts Japanese-Chinese translation accuracy with linguistic analysis
Researchers have developed a retrieval-augmented generation (RAG) system combined with prompting techniques to improve Japanese-Chinese machine translation, particularly for sentences with noun-modifying clause construc…
-
VLMs over-correct math OCR, hiding student errors; new metric PINK improves evaluation
Researchers have identified a significant issue in evaluating handwritten math OCR systems, particularly with Vision-Language Models (VLMs). These models often over-correct student errors instead of accurately transcrib…
-
New study compares pose estimators for sign language translation systems
A new paper evaluates various pose estimation systems for their effectiveness in sign language translation (SLT). Researchers compared common tools like MediaPipe Holistic and OpenPose against newer models such as SDPos…
-
LLM code translation evaluation moves beyond BLEU to semantic correctness
A new paper analyzes cross-lingual text simplification (CLTS) strategies for English and French using large language models. The study compared five prompting systems, including direct, composition, and decomposition ap…