Member-only story
Transformer models have achieved great success in NLP domains and also gained significant interest and breakthroughs in other domains like computer vision/image, speech, audio, protein sequences, organic molecules, tabular data, and also multimodal. Including Tesla’s adoption of Transformer in its vision architecture.

With the evolution of the vanilla transformer, a variety of optimization has been introduced from both academia and industry to solve the bottlenecks of the original transformer.
This article will go over several of Transformer’s siblings (Longformer, Reformer, Linformer, Performer…) and focus on exploring different optimization strategies and inspiring projects around making transformer architecture more efficient and be able to handle longer input sequences.
Later we will discuss a couple of industry implementations to optimize Transformer’s inference performance.
If you are not familiar with Transformer and attention mechanism, you can refer to the paper and this illustration to pick it up.
First, let’s recap on the transformer architecture:

The Problem of Transformer:
- Scales poorly with the length of the input sequence (Self-attention layer becomes the bottleneck in Transformer encoder and decoder block when input sequence grows longer)
- Requiring quadratic computation time and space to produce all similarity scores in each self-attention layer.
Before we diving deep into the optimizations and mainstream solutions, let’s first take a at look how self-attention and full attention matrix are calculated, since the majority of this article will cover how to optimize this attention matrix calculation.