PulseAugur
实时 17:46:25
English(EN) Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

新框架TOTEN改进了技术符号的标记化

研究人员开发了TOTEN,一个基于知识的本体标记化框架,旨在改进巴西葡萄牙语中技术符号的语义理解。与传统的字节对编码不同,TOTEN使用工程实体的形式本体来分类和表示物理量、单位和表达式。评估表明,TOTEN在本体原子性和数值重构方面显著优于最先进的基线,证明了其鲁棒性和准确性。 AI

影响 这项研究可能导致对技术文档和科学文献进行更准确、更具语义意识的处理。

排序理由 该集群包含一篇详细介绍新标记化框架的研究论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

新框架TOTEN改进了技术符号的标记化

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · Antonio de Sousa Leit\~ao Filho; Allan Kardec Duailibe Barros Filho; Fabr\'icio Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa ·

    Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

    arXiv:2606.19626v1 Announce Type: new Abstract: Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically…

  2. arXiv cs.CL TIER_1 English(EN) · Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa ·

    Toten: 巴西葡萄牙语中基于知识的本体物理量和技术符号标记化

    Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowled…