Researchers have developed TOTEN, a knowledge-based ontological tokenization framework designed to improve the semantic understanding of technical notation in Brazilian Portuguese. Unlike traditional byte-pair encoding, TOTEN uses a formal ontology of engineering entities to classify and represent physical quantities, units, and expressions. Evaluations show TOTEN significantly outperforms state-of-the-art baselines in ontological atomicity and numerical reconstruction, demonstrating its robustness and accuracy. AI
IMPACT This research could lead to more accurate and semantically aware processing of technical documents and scientific literature.
RANK_REASON The cluster contains a research paper detailing a new framework for tokenization.
- Antonio Leitao Filho
- Brazilian Portuguese
- byte-pair encoding
- EngQuant
- physical quantities
- quantulum3
- Toten
- Unicode Character Database
- Ontology of Engineering Entities
- Pint
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →