Member-only story
Transformer architecture has achieved state-of-the-art results in many NLP (Natural Language Processing) tasks. One of the main breakthroughs with the Transformer model could be the powerful GPT-3 released in the middle of the year, which has been awarded Best Paper at NeurIPS2020.
In Computer Vision, CNNs have become the dominant models for vision tasks since 2012. There is an increasing convergence of computer vision and NLP with much more efficient class of architectures.
Using Transformers for vision tasks became a new research direction for the sake of reducing architecture complexity, and exploring scalability and training efficiency.
The following are a couple of well known projects in the related work:
- DETR (End-to-End Object Detection with Transformers), uses transformer for object detection and segmentation
- Vision Transformer (AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE), uses transformer for image classification
- Image GPT (Generative Pretraining from Pixels), uses transformer for pixel level image completion, just like other GPT for text completion
- End-to-end Lane Shape Prediction with Transformers, uses transformer for lane marking detection in autonomous driving
Architecture
Overall, there are 2 major model architectures in the related work of adopting transformer in CV. One is pure transformer architecture, the other is the hybrid architecture which combines the CNNs/backbone and the Transformer.
- Pure Transformer
- Hybrid: (CNNs+ Transformer)
Vision Transformer is the full self attention based Transformer architecture without CNNs and can be used out of the box, while DETR is an example of using the hybrid model architecture, which combines the convolutional neural network (CNNs) with…