Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

AI, artificial intelligence, data science, Machine Learning, Predictive Analytics
3618 Views

How to Build a Recommendation System at Scale: Insights from Instacart
Government by AI? Trump Administration Plans to Write Regulations Using Artificial Intelligence
From Text To Tables: Why Structured Data Is AI’s Next $600 Billion Frontier

4 years ago
Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

By: Michael Ryoo, Research Scientist, Robotics at Google and Anurag Arnab, Research Scientist, Google Research

Originally published in Google AI Blog, Dec 7, 2021.

Transformer models consistently obtain state-of-the-art results in computer vision tasks, including object detection and video classification. In contrast to standard convolutional approaches that process images pixel-by-pixel, the Vision Transformers (ViT) treat an image as a sequence of patch tokens (i.e., a smaller part, or “patch”, of an image made up of multiple pixels). This means that at every layer, a ViT model recombines and processes patch tokens based on relations between each pair of tokens, using multi-head self-attention. In doing so, ViT models have the capability to construct a global representation of the entire image.

At the input-level, the tokens are formed by uniformly splitting the image into multiple segments, e.g., splitting an image that is 512 by 512 pixels into patches that are 16 by 16 pixels. At the intermediate levels, the outputs from the previous layer become the tokens for the next layer. In the case of videos, video ‘tubelets’ such as 16x16x2 video segments (16×16 images over 2 frames) become tokens. The quality and quantity of the visual tokens decide the overall quality of the Vision Transformer.

The main challenge in many Vision Transformer architectures is that they often require too many tokens to obtain reasonable results. Even with 16×16 patch tokenization, for instance, a single 512×512 image corresponds to 1024 tokens. For videos with multiple frames, that results in tens of thousands of tokens needing to be processed at every layer. Considering that the Transformer computation increases quadratically with the number of tokens, this can often make Transformers intractable for larger images and longer videos. This leads to the question: is it really necessary to process that many tokens at every layer?

To continue reading this article, click here.

4 thoughts on “Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize”

Pingback: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize « Machine Learning Times – The Predictive Analytics Times – Machine.Vision
Pingback: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize « Machine Learning Times – The Predictive Analytics Times – accutreq software
Pingback: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize « Machine Learning Times – The Predictive Analytics Times - AI Caosuo
howardroark on April 29, 2024 at 3:40 pm said:

This blog post provides valuable insights into the innovative approach of Vision Transformers (ViT) in computer vision tasks, particularly in processing images as sequences of patch tokens.
For those interested in staying updated on cutting-edge developments in AI and computer vision, I recommend checking out the blog of Stobox https://blog.stobox.io/what-is-utility-token-and-how-you-can-use-it/, a leading tokenization provider. While their expertise lies in tokenization solutions, their blog covers a wide range of topics in emerging technologies, including AI and machine learning. It’s a great resource for staying informed about the latest advancements in the field.

Industry News

More Industry News...

Connect with Us

Stay update with all news!

Subscription

Produced By:

EXCLUSIVE HIGHLIGHTS

Related

4 years ago
Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

Originally published in Google AI Blog, Dec 7, 2021.

4 thoughts on “Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize”

Login

Industry News

Connect with Us

Subscription

ADVERTISEMENTS

Produced By:

Archives

The Machine Learning Times © 2026 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190
Produced by: Rising Media & Prediction Impact

EXCLUSIVE HIGHLIGHTS

Related

4 years agoImproving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

Originally published in Google AI Blog, Dec 7, 2021.

Recommended

How to Build a Recommendation System at Scale: Insights from Instacart

Government by AI? Trump Administration Plans to Write Regulations Using Artificial Intelligence

From Text To Tables: Why Structured Data Is AI’s Next $600 Billion Frontier

Is A.I. Actually a Bubble?

4 thoughts on “Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize”

Login

Industry News

Connect with Us

Subscription

ADVERTISEMENTS

Produced By:

Archives

The Machine Learning Times © 2026 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190 Produced by: Rising Media & Prediction Impact

4 years ago
Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

The Machine Learning Times © 2026 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190
Produced by: Rising Media & Prediction Impact