Back to Glossary
Tokenization
Models & Architectures
Splitting text into tokens (processing units).
Tokenization divides text into units (e.g., subwords) that a model can process.
- Methods: BPE, WordPiece, Unigram; language- and domain-specific variants.
- Impact: Affects context length, OOV handling, and efficiency.
- Practice: Consistent preprocessing pipelines and vocabulary versioning.