PulseAugur
实时 17:26:00
English(EN) Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

递归切块在柬埔寨语农业文献RAG中表现优异

研究人员评估了四种文本切块策略,用于一个检索增强生成(RAG)框架,并使用了柬埔寨语农业文献。研究发现,基于字符的递归切块方法,切块大小为300个字符,表现最佳。该方法实现了最低的L2距离和最高的答案相关性及柬埔寨语交并比(IoU)得分,与基于句子的方法相比有显著改进。 AI

影响 提高了低资源语言的RAG性能,可能在专业领域实现更好的信息获取。

排序理由 学术论文,详细介绍了针对特定语言和领域的文本切块策略评估。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. arXiv cs.CL TIER_1 English(EN) · Sovandara Chhoun, Pichdara Po, Sereiwathna Ros, Wan-Sup Cho, Saksonita Khoeurn ·

    Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

    arXiv:2605.22203v1 Announce Type: new Abstract: In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The d…

  2. arXiv cs.CL TIER_1 English(EN) · Saksonita Khoeurn ·

    Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

    In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 mult…