In this blog post co-authored with Wenhao Chai, we revisit the landscape of efficient Transformer variants from a unified view of fast weight programming.
The core idea is to view the hidden state of linear attention models not just as a memory buffer, but as a set of Fast Weights that are dynamically updated during the forward pass. This contrasts with the “Slow Weights” (the model parameters) learned during training.
From this Online Optimization Perspective, the state update rule in architectures like Linear Attention, DeltaNet, and TTT (Test-Time Training) can be interpreted as a step of gradient descent on a specific objective function. For example, standard linear attention effectively minimizes the alignment error between the projected query and the target value stored in the state.
We provide a framework that connects several recently emerged architectures:
By understanding these models as a family of parametric systems that update their internal representation based on the input stream, we can better understand their capabilities and how this mechanism is related to their continual learning capabilities.