BERTScore: Evaluating text generation with BERT
PulseAugur coverage of BERTScore: Evaluating text generation with BERT — every cluster mentioning BERTScore: Evaluating text generation with BERT across labs, papers, and developer communities, ranked by signal.
8 day(s) with sentiment data
-
New AI models tackle low-resource Tangkhul-English translation
Researchers have developed two neural machine translation systems for the low-resource Tangkhul-English language pair. The primary system, utilizing ByT5-large fine-tuned on over 38,000 parallel sentences, achieved a BL…
-
LLM attribution metrics lack transferability across datasets, study finds
A new research paper investigates the reliability of automatic metrics used to evaluate attribution in retrieval-augmented generation (RAG) systems. The study found that common attribution metrics, including lexical, em…
-
LLMs struggle with Hausa and Fongbe translation, metrics unreliable
A new study evaluated the machine translation capabilities of four large language models (LLMs) for Hausa and Fongbe, two West African languages. The research found that while Hausa achieved acceptable translation quali…
-
New RECOM dataset reveals metric tradeoff in LLM evaluation
Researchers have introduced RECOM, a new evaluation dataset designed to assess automatic metrics for open-ended question answering, particularly for LLM-generated text. The dataset, comprising 15,000 r/AskReddit questio…
-
Researchers caution on synthetic data quality after fine-tuning Mistral 7B
Researchers have developed a method to fine-tune a 7B language model on free-tier GPUs by using an adapter-handoff technique. This approach allows for multi-epoch fine-tuning by checkpointing only the small LoRA adapter…
-
New geometric framework measures semantic information in text
Researchers have developed a new geometric framework to measure the semantic information contained within a text. This framework, detailed in a recent paper, offers a three-coordinate semantic profile that captures nove…
-
AI uses curriculum learning and multiple models for better medical text generation
Researchers have developed a new framework for medical text generation that uses a severity-aware curriculum learning approach with multiple large language models. This method trains models sequentially on cases of incr…
-
New framework uses multiple models for better text summarization
Researchers have developed a Multi-Model Adaptive Summarization Framework (MASF) to enhance abstractive text summarization. This framework integrates multiple fine-tuned transformer models, each generating a summary for…
-
New MATCHA metric improves LLM text evaluation by penalizing contradictions
Researchers have developed MATCHA, a new metric designed to more accurately evaluate the semantic similarity of text generated by large language models. Unlike existing metrics like ROUGE and BERTScore, which can incorr…
-
Medical QA RAG trainability hinges on checker output distribution, not accuracy
A new research paper explores the trainability of medical question-answering systems that use retrieval-augmented generation (RAG) guided by a Natural Language Inference (NLI) checker. The study reveals that the checker…
-
GraphRAG cuts token use by 60% on quantum papers
A project developed for the TigerGraph GraphRAG Inference Hackathon demonstrated that GraphRAG significantly reduces token consumption and improves accuracy for complex queries. By constructing a knowledge graph of enti…
-
Mistral, QWen models show divergent strategies in biomedical text simplification
A new research paper compares the text simplification strategies of Mistral-Small and QWen2.5 when applied to biomedical information. The study found that Mistral-Small effectively balances readability and accuracy, per…
-
Researchers improve medical VQA with trajectory-aware process supervision
Researchers have developed a novel method to improve medical visual question answering (VQA) systems by incorporating trajectory-aware process supervision. This approach utilizes a two-stage training framework, starting…
-
New DESG model improves AI therapist evaluation beyond LLM judges
Researchers have developed a new model-agnostic evaluator called Dynamic Emotional Signature Graphs (DESG) to assess the quality of AI-generated responses in mental health dialogues. This method moves beyond simple text…
-
LLMs favor their own resumes in hiring, study finds
A new study reveals that Large Language Models (LLMs) exhibit a significant self-preference bias in hiring processes, favoring resumes generated by themselves over human-written ones. This bias, ranging from 67% to 82% …
-
New RCD method optimizes LLM processing of long clinical texts within budget
Researchers have developed a new method called RCD for selecting relevant subsets of long clinical texts to reduce token costs for large language models. This approach frames the problem as a knapsack-constrained subset…
-
New HATS dataset integrates human perception for ASR evaluation
Researchers have introduced HATS, a new French dataset designed to evaluate Automatic Speech Recognition (ASR) systems by incorporating human perception. The dataset was created by having 143 individuals compare and sel…
-
New research proposes reasoning-aware training for better dialogue summarization
Researchers have developed a new framework for multi-role dialogue summarization that moves beyond traditional overlap metrics like ROUGE. Their approach incorporates explicit cognitive-style reasoning and reward-based …
-
ArgRE system uses formal argumentation to improve AI agent requirements negotiation
Researchers have developed ArgRE, a novel system for resolving conflicts in multi-agent requirements negotiation for complex software systems. ArgRE embeds Dung-style abstract argumentation, modeling proposals and criti…