Tokens: Why ChatGPT Can't Count the R's in 'Strawberry'
Language models process text by breaking it down into tokens, which are typically chunks of a few characters. This subword tokenization approach is used because using whole words would create an unmanageably large vocabulary, while using individual letters would require the model to relearn basic spelling. The number of tokens directly impacts API costs and context window limitations, making concise prompting a significant factor in managing expenses and efficiency. Consequently, models struggle with tasks that require precise character-level analysis, such as counting specific letters within a word, because they operate on these tokenized subword units rather than individual characters. AI
IMPACT Understanding tokenization is key for optimizing LLM prompts and managing costs.