Blog

Understanding Activation Memory Dynamics in Pipeline Parallelism Variants

How 1F1B schedule in PipeDream reduces activation memory hold on GPU compared to naive PP schedule like GPipe.

2 min read · February 14, 2026

2026 · pipeline-parallelism mlsys visual · technical
How Thread Block Swizzling boosts L2 Cache Hit Rate in Matrix Multiplication

A visual aid to Triton's official matmul tutorial, explaining how thread block level swizzling increases L2 cache hit rate.

2 min read · February 07, 2026

2026 · io-aware gpu triton mlsys visual kernel · technical
Implementing Flash Attention: Backward Pass in Triton

In this follow-up post to Nathan Chen's Triton Flash Attention Kernel Walkthrough: The Forward Pass, we dive into gradient computation for queries, keys, and values in Flash Attention's backward pass.

18 min read · January 30, 2026

2026 · attention gpu io-aware kernel mlsys triton · technical
View Transformer Layers from Online Optimization Perspective

In this blog post co-authored with Wenhao Chai, we revisit the landscape of efficient Transformer variants from a unified view of fast weight programming.

1 min read · July 17, 2025

2025 · attention transformer sequence-modeling llm-architecture optimization · technical