Video comprehension is a challenging task that requires reasoning about both spatial information (eg, for objects in a scene, including their locations and relationships) and temporal information about the actions or events shown in the video. There are many video understanding applications and tasks such as semantic content of web videos and robot perception. However, current works such as ViViT and TimeSFormer process video intensively and require significant computation, especially as the model size plus video length and resolution increases.
To be presented at CVPR 2023 “Reinterpretation of ViTo ViTs. Sparse Video Pipes for Joint Image and Video Learning”, we present a simple technique that transforms the Vision Transformer (ViT) model’s image encoder into an efficient video backbone using sparse video pipes (Learnable Visual Representations of Samples from Video) to reduce the computational needs of the model. This approach can seamlessly process both images and videos, allowing the use of both image and video data sources during training. This learning further enables our sparse pipe ViT model to combine the image and video backbone together as an image or video backbone (or both) depending on the input. We show that this model is scalable, can be adapted to large pre-trained ViTs without requiring full refinement, and achieves state-of-the-art results for many video classification benchmarks.
![]() |
The use of sparse video pipes for video sampling combined with a standard ViT encoder results in an efficient visual representation that can be seamlessly shared across image inputs. |
Creating a joint image-video backbone
Our sparse tube ViT uses a standard ViT backbone consisting of transformer layers that process the video information. Previous methods, such as ViViT, densify the video and then apply factored attention, i.e. attention weights for each cue are calculated separately for the temporal and spatial dimensions. In the standard ViT architecture, self-attention is calculated over the entire token sequence. When using videos as input, the symbol sequence becomes quite long, which can slow down this calculation. Instead, the video is diluted with our proposed method video tubes, which are 3D learning visuals from video in various shapes and sizes (described in more detail below). These pipes are used to sparsely sample the video using a large time step, that is, when the pipe kernel is applied to only a few places in the video, rather than every pixel.
With sparse video pipe sampling, we can use the same global self-focus module, rather than factorized focus like ViViT. We show experimentally that adding layers of factorized attention can hurt performance due to untrained weights. This single pack of ViT spine transformer strips also enables better weight distribution and improves performance. Sparse sampling of the video tube is performed using a large spatial and temporal step that selects symbols on a fixed grid. The large step reduces the number of symbols in the entire network while capturing both spatial and temporal information and enables efficient processing of all symbols.
Video thin tubes
Tubes are 3D grid-based cubes that can have different shapes or categories and capture different information with steps and starting locations that can overlap. In the model, we use three different pipe shapes that capture: (1) spatial information only (resulting in a number of slices of a 2D image), (2) long temporal information (in a small spatial area), and (3) both spatial. and temporal information equally. Pipes that capture only spatial information can be applied to both image and video inputs. Pipes, which equally capture temporal or spatial information, apply only to video inputs. Depending on the size of the input video, the three pipe shapes are applied multiple times to the model to create tokens.
A fixed-position embedding is applied to the video pipes, which fixes the global position of each pipe (including any steps, offsets, etc.) relative to all other pipes. Unlike previous learned position embeddings, this fixed one better enables sparse, overlapping sampling. Capturing the global pipe position helps the model know where each one came from, which is especially useful when pipes overlap or are sampled from distant video locations. The pipe features are then joined together to form a series N: signs. These symbols are processed by the standard ViT encoder. Finally, we apply attention pooling to compress all symbols into a single representation and input a fully connected (FC) layer to perform the classification (e.g., playing soccer, swimming, etc.).
![]() |
Our video ViT model works with sparse video pipe sampling (shown below) to enable seamless processing of two or two image or video inputs. These tubes have different shapes and capture different video features. Tube 1 (yellow) captures only spatial information, resulting in a set of 2D patches that can be applied to image input. Tube 2 (red) captures temporal information and some spatial information and pipe 3 (green) equally captures both temporal and spatial information (i.e., the spatial extent of the pipe x: and: y: are the same as the number of frames t:). Pipes 2 and 3 can only be applied to video inputs. Position embedding is added to all pipe features. |
Scaling of video ViTs
The video backbone construction process is computationally intensive, but our sparse tubular ViT model enables computationally efficient scaling of video models using previously trained image backbones. Since the image backbone can be adapted to the video backbone, a large image backbone can be converted into a large video backbone. More specifically, one can transfer the learned video feature representations from a small pipeline ViT to a large pre-trained ViT image and train the resulting model with the video data in just a few steps, as opposed to fully training from scratch.
![]() |
Our approach makes it possible to extend the sparse tube ViT more efficiently. In particular, the video stands out from a small video from ViT (top grid) can be transferred to a large, pre-rendered ViT image (lower grid), and with further accuracy. This requires fewer training steps to achieve robust performance with a large model. This is advantageous because large video models can be too expensive to train from scratch. |
Results
We evaluate our sparse tubular ViT approach using the Kinetics-400 (shown below), Kinetics-600, and Kinetics-700 data sets and compare its performance to a long list of previous methods. We find that our approach outperforms all previous methods. Importantly, it outperforms all state-of-the-art methods jointly trained on image+video datasets.
![]() |
Performance compared to several previous works on the popular Kinetics-400 video database. Our thin tube ViT is superior to modern methods. |
Furthermore, we test our sparse tubular ViT model on the Something-Something V2 dataset, which is commonly used to evaluate more dynamic operations, and report that it outperforms all previous state-of-the-art approaches.
![]() |
Performance on the Something-Something V2 video dataset. |
Visualization of some learned kernels
It is interesting to understand what elementary features the proposed model learns. We visualize these below, showing both 2D patches shared for both images and video and video pipes. These visualizations show the 2D or 3D information captured by the projection layer. For example, 2D patches detect various general features such as edges and colors, while 3D pipes capture basic shapes and how they may change over time.
![]() |
Patch and pipe visualizations learned the sparse pipe ViT model. The top row is the 2D patches and the remaining two rows are images from the learned video pipelines. The tubes show each patch for the 8 or 4 frames to which they are applied. |
Conclusions
We have introduced a new sparse pipe ViT, which can transform the ViT encoder into an efficient video model and can work seamlessly with both image and video inputs. We have also shown that large video encoders can be offloaded from small video encoders and image-only ViTs. Our approach outperforms previous methods on several popular video comprehension measures. We believe that this simple representation can facilitate much more efficient learning with video inputs, seamlessly incorporate image or video inputs, and effectively eliminate the dichotomy of image and video models for future multimodal perception.
Gratitude
This work is led by AJ Piergiovanni, Weicheng Kuo, and Anelia Angelova, who are now at Google DeepMind. We thank Abhijit Ogal, Luowei Zhou, Claire Kui, and our colleagues at Google Research for helpful discussions, comments, and support.