An efficient machine learning method tailored for linear attention-based language models
Researchers at the Shanghai Artificial Intelligence Laboratory and TapTap have proposed Linear Attention Sequential Parallelism (LASP) technology, which optimizes sequence parallelism on linear transformers. It uses point-to-point (P2P) communication to effectively exchange state between GPUs within or between nodes. LASP makes the most of the right product kernel technique in linear attention. Importantly, it does not rely on attention header partitions, making it suitable for multi-head, multi-query, and grouped query attention.
LASP uses a tiling method to divide the input sequence into sub-sequence blocks distributed on the GPU. It divides attention calculations into intra-block and inter-block to take advantage of the right product advantage of linear attention. Traditional attention calculations are used within blocks, while kernel techniques are used between blocks. The method also includes data distribution, forward transfer and backward transfer mechanisms to improve parallel processing efficiency.
Paper: https://arxiv.org/abs/2404.02882
Thesis: https://arxiv.org/abs/2404.02882
GitHub:https://github.com/OpenNLPLab/LASP
Video: