Back to Glossary

Transformer

Models & Architectures

Sequence model using attention instead of recurrence or convolution.


The Transformer relies on self-attention to capture long-range dependencies efficiently.

  • Advantages: Parallel processing, scalability, state-of-the-art performance in NLP and multimodal tasks.
  • Components: Multi-head attention, positional embeddings, feedforward blocks, residuals, normalization.
  • Considerations: Context window, compute requirements, optimization/memory tricks.