Back to Glossary

Tokenization

Models & Architectures

Splitting text into tokens (processing units).


Tokenization divides text into units (e.g., subwords) that a model can process.

  • Methods: BPE, WordPiece, Unigram; language- and domain-specific variants.
  • Impact: Affects context length, OOV handling, and efficiency.
  • Practice: Consistent preprocessing pipelines and vocabulary versioning.