Back to Glossary
Transformer
Models & Architectures
Sequence model using attention instead of recurrence or convolution.
The Transformer relies on self-attention to capture long-range dependencies efficiently.
- Advantages: Parallel processing, scalability, state-of-the-art performance in NLP and multimodal tasks.
- Components: Multi-head attention, positional embeddings, feedforward blocks, residuals, normalization.
- Considerations: Context window, compute requirements, optimization/memory tricks.