PulseAugur
EN
LIVE 13:28:50

New framework TOTEN improves tokenization of technical notation

Researchers have developed TOTEN, a knowledge-based ontological tokenization framework designed to improve the semantic understanding of technical notation in Brazilian Portuguese. Unlike traditional byte-pair encoding, TOTEN uses a formal ontology of engineering entities to classify and represent physical quantities, units, and expressions. Evaluations show TOTEN significantly outperforms state-of-the-art baselines in ontological atomicity and numerical reconstruction, demonstrating its robustness and accuracy. AI

IMPACT This research could lead to more accurate and semantically aware processing of technical documents and scientific literature.

RANK_REASON The cluster contains a research paper detailing a new framework for tokenization.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New framework TOTEN improves tokenization of technical notation

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Antonio de Sousa Leit\~ao Filho; Allan Kardec Duailibe Barros Filho; Fabr\'icio Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa ·

    Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

    arXiv:2606.19626v1 Announce Type: new Abstract: Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically…

  2. arXiv cs.CL TIER_1 English(EN) · Antonio de Sousa Leitão Filho; Allan Kardec Duailibe Barros Filho; Fabrício Saul Lima; Selby Mykael Lima dos Santos; Rejani Bandeira Vieira Sousa ·

    Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

    Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowled…