Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

Researchers have developed PACUTE, a new diagnostic benchmark comprising 4,600 tasks specifically designed to assess the morphological understanding of large language models (LLMs) in Filipino. This language presents unique challenges due to its complex morphology, including infixation and reduplication, which standard tokenizers often fail to capture. Evaluations of both open-weight and frontier commercial LLMs revealed that while frontier models show improved performance in identifying morphemes, they still struggle with tasks involving productive morphological composition and syllabification, indicating this remains a significant bottleneck for their linguistic capabilities. AI

IMPACT Identifies morphological composition as a persistent bottleneck for LLMs, guiding future research in linguistic understanding.

Hugging Face
arXiv
DagsHub
alphaXiv
ScienceCast
CatalyzeX
Gotit.pub
Filipino
PACUTE
Jann Railey Montalan