This tutorial demonstrates how to build memory-efficient Transformer models using the xFormers library on GPUs. It covers implementing and comparing memory-efficient attention with standard attention, analyzing techniques like causal masking, packed sequences, grouped-query attention (GQA), and ALiBi positional biases. The guide also shows how to combine these methods into a trainable GPT-style model utilizing xFormers attention and SwiGLU feed-forward layers with automatic mixed-precision training. AI
IMPACT Provides practical guidance for optimizing Transformer models, potentially reducing computational costs and improving inference speed.
RANK_REASON The item is a tutorial demonstrating implementation of existing techniques for optimizing transformer models, rather than a novel research paper or a new model release. [lever_c_demoted from research: ic=1 ai=1.0]
- Alibi
- Causal attention
- CUDA
- generative pre-trained transformer
- GQA
- GPU
- PyTorch
- SwiGLU
- transformers
- xformers
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →