A curated list of efficient attention modules (last update: Wed, 10 Mar 2021 23:52:22 +0000)
Paper (citations)  Implementation  Computational Complexity  AutoRegressive  Main Idea 

Generating Wikipedia by Summarizing Long Sequences (282)  memorycompressedattention  ✔️  EXPANDcompresses key and value + blocked attention 

CBAM: Convolutional Block Attention Module (999+)  attentionmodule  ❌  EXPANDcombines the SE attention with a per pixel(local) weight 

Set Transformer: A Framework for Attentionbased PermutationInvariant Neural Networks (16)  set_transformer  ❌  EXPANDuses K relay nodes 

CCNet: CrissCross Attention for Semantic Segmentation (296)  CCNet  ❌  EXPANDeach pixel attends to its row and column simultaneously 

Efficient Attention: Attention with Linear Complexities (16)  efficientattention  ❌  EXPANDSoftmax(Q)*(Softmax(K^T)*V) 

StarTransformer (40)  fastNLP  ❌  EXPANDuses a relay(global) node and attends to/from that node 

GCNet: Nonlocal Networks Meet SqueezeExcitation Networks and Beyond (199)  GCNet  ❌  EXPANDsqueeze and excitation with an attention pooling (instead of a GAP) 

Generating Long Sequences with Sparse Transformers (257)  DeepSpeed  ✔️  EXPANDsparse block based attention 

SCRAM: Spatially Coherent Randomized Attention Maps (1)    ✔️  EXPANDuses PatchMatch to find close keys 

Interlaced Sparse SelfAttention for Semantic Segmentation (24)  IN_PAPER  ✔️  EXPANDcombination of a short length and then long range(dilated) attention 

Permutohedral Attention Module for Efficient NonLocal Neural Networks (3)  Permutohedral_attention_module  ❌  EXPANDuses permutohedral lattice approximation algorithm to approximate the attention output 

Large Memory Layers with Product Keys (43)  XLM  ✔️  EXPANDsearch for nearest neighbor keys 

ExpectationMaximization Attention Networks for Semantic Segmentation (79)  EMANet  ❌  EXPANDapplys expectation maximization to cluster keys into k clusters 

BPTransformer: Modelling LongRange Context via Binary Partitioning (15)  BPT  ✔️  EXPANDattends to distant tokens coarsely and attends to close tokens in a more finegrained manner 

Compressive Transformers for LongRange Sequence Modelling (48)  compressivetransformerpytorch  ✔️  EXPANDcompresses distant tokens instead of just stop_grad() ing them, more efficient version of transformerXL 

Axial Attention in Multidimensional Transformers (36)  axialattention  ✔️  EXPANDapply attention on each axis separately 

Reformer: The Efficient Transformer (216)  trax  ✔️  EXPANDuses LSH to find close keys 

Sparse Sinkhorn Attention (16)  sinkhorntransformer  ✔️  EXPANDuses a cost matrix to limit attention between buckets 

Transformer on a Diet (2)  transformerondiet  ✔️  EXPANDdilated transformer like wavenet 

Timeaware Large Kernel Convolutions (9)  TaLKConvolutions  ✔️  EXPANDcalculate mean over a dynamic subsequence around each token with the help of summedarea table 

SAC: Accelerating and Structuring SelfAttention via Sparse Adaptive Connection (2)    ✔️  EXPANDlearns the q, k connections == dynamically creates a sparse attention matrix 

Efficient ContentBased Sparse Attention with Routing Transformers (38)  routingtransformer  ✔️  EXPANDcomputes attention with samecluster tokens (computed by online kmeans) 

Neural Architecture Search for Lightweight NonLocal Networks (11)  AutoNL  ❌  EXPANDcomputes Q(KV) and also down samples q, k, v both in spatial and channel dimensions 

Longformer: The LongDocument Transformer (159)  longformer  ✔️  EXPANDglobal + blocked attention 

ETC: Encoding Long and Structured Inputs in Transformers (16)    ❌  EXPANDcombines global attention (star transformer with multiple global tokens) with local attention 

Multiscale Transformer Language Models (2)  IN_PAPER  ✔️  EXPANDUNet like + retina attetion is something close to BPTransformer 

Synthesizer: Rethinking SelfAttention in Transformer Models (26)  SynthesizerRethinkingSelfAttentionTransformerModels  ✔️  EXPANDdoes not compute pairwise interactions 

Jukebox: A Generative Model for Music (45)  jukebox  ✔️  EXPANDbetter attention patterns from Sparse Transformer 

Inputindependent Attention Weights Are Expressive Enough: A Study of Attention in Selfsupervised Audio Transformers (0)    ✔️  EXPANDdoes not compute pairwise interactions and uses fixed mask patters 

GMAT: Global Memory Augmentation for Transformers (2)  gmat  ❌  EXPANDadds global tokens 

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (45)  fasttransformers  ✔️  EXPANDuses phi(q)(phi(k)v) and also improves the sequential sampling step 

Linformer: SelfAttention with Linear Complexity (47)  linformerpytorch  ❌  EXPANDproject key and value from nd to kd 

Masked Language Modeling for Proteins via Linearly Scalable LongContext Transformers (8)  googleresearch  ✔️  EXPANDcalculate an unbiased stochastic approximation of the attention matrix 

Kronecker Attention Networks (1)  kroneckerattentionpytorch  ❌  EXPANDuses horizontal and lateral average matrices 

Realtime Semantic Segmentation with Fast Attention (5)    ❌  EXPANDl2_norm(q)*(l2_norm(k)*v) 

Fast Transformers with Clustered Attention (6)  fasttransformers  ❌  EXPANDgroups queries together with LSH 

Big Bird: Transformers for Longer Sequences (60)  DeepSpeed  ❌  EXPANDETC with random connections 

Tensor LowRank Reconstruction for Semantic Segmentation (3)    ❌  EXPANDdecompose the full attention tensor into rank one tensors (CP decomposition) 

Looking for change? Roll the Dice and demand Attention (0)  IN_PAPER  ❌  EXPANDuses the fractal tanimoto similarity to compare queries with keys inside the attention module 

Rethinking Attention with Performers (30)  googleresearch  ✔️  EXPANDunbiased approximation of the attention matrix with softmax kernel 

Memformer: The MemoryAugmented Transformer (0)  memformer  ✔️  EXPANDattend to memory slots + MemoryReplay BackPropagation 

SMYRF: Efficient Attention using Asymmetric Clustering (1)  smyrf  ❌  EXPANDLSH with balanced clusters 

Informer: Beyond Efficient Transformer for Long Sequence TimeSeries Forecasting (0)  Informer2020  ✔️  EXPANDsparse attention + funnel like encoder 

SubLinear Memory: How to Make Performers SLiM (0)  googleresearch  ✔️  EXPANDPerformer but with sublinear Memory usage 

Nyströmformer: A NyströmBased Algorithm for Approximating SelfAttention (0)  Nystromformer  ❌  EXPANDuses Nystrom method to approximate the attention matrix 

Linear Transformers Are Secretly Fast Weight Memory Systems (0)  fastweighttransformers  ✔️  EXPANDshow that linear transformers are basically fast weight networks + propose a new kernel function to linearise attention, balancing simplicity and effectiveness 

LambdaNetworks: Modeling LongRange Interactions Without Attention (6)  lambdanetworks  ✔️  EXPANDgenerates a linear layer based on context + decouple pos/context 

Random Feature Attention (2)    ✔️  EXPANDkernel approximation and also transformers are rnn 