CONVOLUTIONAL NEURAL NETWORKS Vs VISION TRANSFORMERS: EVOLVING PARADIGM IN COMPUTER VISION

Artificial Intelligence

Computer Vision

Authors

Muhammad Talha Waqas

Published

Apr 11, 2024

Save

Tip

Document

Flag content

Tip

Save

Document

Flag content

INTRODUCTION

For decades, convolutional neural networks (CNNs) have dominated computer vision, performing admirably in image identification, object detection, and other visual tasks. Their capacity to understand hierarchical characteristics and exploit spatial correlations within images has established them as the de facto standard. However, recent advances in transformer construction have spawned a new competitor: vision transformers (ViTs). This article looks into the fundamental principles of CNNs and ViTs, examining their strengths and drawbacks, as well as the continuous movement in the computer vision field.

CONVOLUTIONAL NEURAL NETWORKS: THE PRINCIPAL DRIVERS OF SPATIAL REASONING

CNNs are a type of deep learning model that is specifically built to analyse visual input. Their architecture is modelled after the biological structure of the visual cortex, including convolutional layers that take advantage of the spatial localization of natural images. Convolutional kernels, commonly known as filters, move across an image to extract local features such as edges, textures, and basic forms. These traits are then integrated and altered in subsequent layers, resulting in a more complicated representation of the image.

CNNs are notable for their innate grasp of spatial relationships. They can learn to detect features at various scales and places within an image by utilising filters of specified sizes and strides. This is critical for applications such as object detection and image segmentation, which need precise object localization and boundary determination. Furthermore, CNNs benefit from parameter sharing, which enables them to develop efficient filters that can be applied to various portions of the image. This minimises the amount of parameters to be taught, resulting in improved generalisation and computational efficiency.

However, CNNs do have limitations. Their reliance on local convolutions may render them less successful at capturing long-range dependencies within an image. CNNs may fail with tasks that require global context awareness, such as reasoning about scene linkages or recognising objects with complicated interactions. Furthermore, creating CNN architectures frequently necessitates precise engineering of filter sizes and strides, which can be time-consuming.

VISION TRANSFORMERS: ATTENTION IS ALL YOU NEED IN COMPUTER VISION

In 2017, Researchers at Google DeepMind led by Ashish Vaswani came up with the paper titled ‘Attention is all you need’, which introduced to the world Transformers that have revolutionised AI on a large scale, but that concept of Transformers was only applicable for text-related tasks, and not for image-related tasks. However, now we have Vision Transformers, which even though are not as ubiquitous as Convolutional Neural Networks, have the potential to play a significant role in high-end image tasks, especially image classification.

Unlike CNNs, ViTs do not use convolutions. Instead, they divide an image into smaller patches and process them using a succession of transformer encoder layers. A transformer encoder's essential building block is the self-attention mechanism. This approach enables each patch to interact with all other patches in the image, allowing the model to represent long-term interdependence and global context. By focusing on relevant patches, a ViT can learn correlations between different sections of the image, resulting in a more complete understanding.

One of the primary benefits of ViTs is their inherent scalability. Transformer designs are very modular, allowing for simple expansion by adding additional encoder layers or increasing patch resolution. This flexibility can be critical for getting cutting-edge performance on large-scale datasets. Furthermore, ViTs show resistance to adversarial attacks, as slight changes to the input image can fool typical CNNs. This is most likely owing to the global attention mechanism, which enables ViTs to rely on more data than simply local features.

Despite their advantages, ViTs do have some disadvantages. Compared to CNNs, they often require more training data to obtain comparable performance. This is because the self-attention process can be computationally costly, particularly for high-resolution visuals. Furthermore, ViTs lack the explicit spatial bias found in CNNs, which might be problematic for jobs that rely significantly on exact localization.

THE EVOLVING LANDSCAPE: TOWARDS HYBRID ARCHITECTURES

The advent of ViTs has created a thriving study field that investigates their capabilities and limitations. Researchers are actively studying methods for increasing efficiency and utilising spatial data. One potential avenue is hybrid architectures, which combine the strengths of CNNs with ViTs. These models may use CNNs for initial feature extraction, followed by ViTs for global context reasoning. This method could potentially attain high accuracy while remaining computationally efficient. Furthermore, research into adding inductive biases unique to spatial interactions into ViTs is underway. This could include introducing learnable positional encodings or creating transformer designs that naturally capture spatial information.

Another interesting development is the study of convolutional transformers. These models integrate convolutional layers and self-attention mechanisms, attempting to reap the benefits of both approaches. Early results show excellent performance, which could lead to a new generation of vision models that excel at both local feature extraction and global context understanding.

CONCLUSION

The introduction of ViTs represents a fundamental change in the computer vision field. While CNNs will most certainly remain a cornerstone for specialised tasks, ViTs provide a powerful option for scenarios that require global context and long-term dependency. Ongoing research on hybrid architectures, spatial biases in ViTs, and convolutional transformers promises to increase the capabilities of vision models.
As datasets grow and computer resources improve, the distinction between CNNs and ViTs is expected to become less clear. Finally, the choice of model will be determined by the unique task needs, with both CNNs and ViTs playing critical roles in pushing the limits of computer vision.

REFERENCES:

Karen Simonyan and Andrew Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Deploying Attention-Based Vision Transformers to Apple Neural Engine. (n.d.). Apple Machine Learning Research. https://machinelearning.apple.com/research/vision-transformers.
Coccomini, D. (2022, January 6). Vision transformers or convolutional neural networks? both! Medium. https://towardsdatascience.com/vision-transformers-or-convolutional-neural-networks-both-de1a2c3c62e4

100%

Discussion

Start the discussion.

This post has not yet been discussed.

CONVOLUTIONAL NEURAL NETWORKS Vs VISION TRANSFORMERS: EVOLVING PARADIGM IN COMPUTER VISION

Scan to connect with one of our mobile apps

Coinbase Wallet app

Coinbase app

Or try the Coinbase Wallet browser extension