Transformer models consistently obtain state-of-the-art results in computer vision tasks, including object detection and video classification. In contrast to standard convolutional approaches that process images pixel-by-pixel, the Vision Transformers (ViT) treat an image as a sequence of patch tokens (i.e., a smaller part, or “patch”, of an image made up of multiple pixels). This means that at every layer, a ViT model recombines and processes patch tokens based on relations between each pair of tokens, using multi-head self-attention. In doing so, ViT models have the capability to construct a global representation of the entire image.
At the input-level, the tokens are formed by uniformly splitting the image into multiple segments, e.g., splitting an image that is 512 by 512 pixels into patches that are 16 by 16 pixels. At the intermediate levels, the outputs from the previous layer become the tokens for the next layer. In the case of videos, video ‘tubelets’ such as 16x16x2 video segments (16×16 images over 2 frames) become tokens. The quality and quantity of the visual tokens decide the overall quality of the Vision Transformer.
The main challenge in many Vision Transformer architectures is that they often require too many tokens to obtain reasonable results. Even with 16×16 patch tokenization, for instance, a single 512×512 image corresponds to 1024 tokens. For videos with multiple frames, that results in tens of thousands of tokens needing to be processed at every layer. Considering that the Transformer computation increases quadratically with the number of tokens, this can often make Transformers intractable for larger images and longer videos. This leads to the question: is it really necessary to process that many tokens at every layer?
To continue reading this article, click here.
You must be logged in to post a comment.
Pingback: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize « Machine Learning Times – The Predictive Analytics Times – Machine.Vision
Pingback: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize « Machine Learning Times – The Predictive Analytics Times – accutreq software
Pingback: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize « Machine Learning Times – The Predictive Analytics Times - AI Caosuo
This blog post provides valuable insights into the innovative approach of Vision Transformers (ViT) in computer vision tasks, particularly in processing images as sequences of patch tokens.
For those interested in staying updated on cutting-edge developments in AI and computer vision, I recommend checking out the blog of Stobox https://blog.stobox.io/what-is-utility-token-and-how-you-can-use-it/, a leading tokenization provider. While their expertise lies in tokenization solutions, their blog covers a wide range of topics in emerging technologies, including AI and machine learning. It’s a great resource for staying informed about the latest advancements in the field.