Next Generation of FlashAttention | NVIDIA Technical Blog

NVIDIA is excited to collaborate with Colfax, Together.ai, Meta, and Princeton University on their recent achievement to exploit the Hopper GPU architecture and Tensor Cores and accelerate key Fused Attention kernels using CUTLASS 3.

FlashAttention-3 incorporates key techniques to achieve 1.5–2.0x faster performance than FlashAttention-2 with FP16, up to 740 TFLOPS. With FP8, FlashAttention-3 reaches up to 1.2 PFLOPS, with 2.6x smaller errors than baseline FP8 attention.

CUTLASS is an open-source CUDA library intended to enable deep learning and HPC practitioners to achieve speed-of-light performance on NVIDIA Tensor Core GPUs for custom algorithms and research and production workloads alike.

For more information about the collaboration, see the FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision post and research paper.