Feb 07, 2026 How Thread Block Swizzling boosts L2 Cache Hit Rate in Matrix Multiplication Jan 30, 2026 Implementing Flash Attention: Backward Pass in Triton