-
Understanding Activation Memory Dynamics in Pipeline Parallelism Variants
How 1F1B schedule in PipeDream reduces activation memory hold on GPU compared to naive PP schedule like GPipe.
-
How Thread Block Swizzling boosts L2 Cache Hit Rate in Matrix Multiplication
A visual aid to Triton's official matmul tutorial, explaining how thread block level swizzling increases L2 cache hit rate.
-
Implementing Flash Attention: Backward Pass in Triton
In this follow-up post to Nathan Chen's Triton Flash Attention Kernel Walkthrough: The Forward Pass, we dive into gradient computation for queries, keys, and values in Flash Attention's backward pass.
-
View Transformer Layers from Online Optimization Perspective
In this blog post co-authored with Wenhao Chai, we revisit the landscape of efficient Transformer variants from a unified view of fast weight programming.