Optimizing ViViT Training: Time and Memory

Optimizing ViViT Training: Time and Memory
Reduction for Action Recognition
Shreyank N Gowda∗1 , Anurag Arnab2 , and Jonathan Huang2

1
University of Edinburgh
2
Google Research
arXiv:2306.04822v1 [cs.CV] 7 Jun 2023
Abstract
In this paper, we address the challenges posed by the substantial training time
and memory consumption associated with video transformers, focusing on the
ViViT (Video Vision Transformer) model, in particular the Factorised Encoder
version, as our baseline for action recognition tasks. The factorised encoder variant
follows the late-fusion approach that is adopted by many state of the art approaches.
Despite standing out for its favorable speed/accuracy tradeoffs among the different
variants of ViViT, its considerable training time and memory requirements still
pose a significant barrier to entry. Our method is designed to lower this barrier and
is based on the idea of freezing the spatial transformer during training. This leads
to a low accuracy model if naively done. But we show that by (1) appropriately
initializing the temporal transformer (a module responsible for processing temporal
information) (2) introducing a compact adapter model connecting frozen spatial
representations ((a module that selectively focuses on regions of the input image)
to the temporal transformer, we can enjoy the benefits of freezing the spatial
transformer without sacrificing accuracy. Through extensive experimentation over
6 benchmarks, we demonstrate that our proposed training strategy significantly
reduces training costs (by ∼ 50%) and memory consumption while maintaining or
slightly improving performance by up to 1.79% compared to the baseline model.
Our approach additionally unlocks the capability to utilize larger image transformer
models as our spatial transformer and access more frames with the same memory
consumption. The advancements made in this work have the potential to propel
research in the video understanding domain and provide valuable insights for
researchers and practitioners with limited resources, paving the way for more
efficient and scalable alternatives in the action recognition field.
1 Introduction
Action recognition focuses on understanding and identifying actions in video sequences with applica-
tions in surveillance, human-computer interaction, and video content analysis. The field has advanced
significantly due to large-scale annotated datasets [6] and a shift from hand-crafted features [23, 41]
to deep learning models like convolutional networks (CNNs) [34, 39, 6, 14]. Recently, transformers
have revolutionized computer vision by offering an alternative to traditional CNNs, leading to the
development of many new state-of-the-art architectures [12, 4, 26]. Moreover, the flexibility of
transformers have inspired researchers to adapt these models to more complex problems, including
video understanding and action recognition [1, 27].
Transformers, however, are notoriously expensive, and Video Transformer-based architectures [3, 1,
27], which integrate information across space and time are even more so. And memory consumption
∗
Work done as a Student Researcher at Google.
Preprint. Under review.

SFA-ViViT-L
ViViT-L
Kinetics-400
SFA-ViViT-L
Accuracy
ViViT-L
SFA-ViViT-B
ViViT-B
Training Time
Figure 1: Comparison of our initialization method vs conventional training of ViViT. Training time
is scaled relative to setting ViViT-B training time to ‘1 unit’ (7.93 hours). We see clear time saving
using our initialization scheme and for larger models, the training time saved is much larger.
and training times become even more significant when working with large-scale video datasets with
long sequences [6]. These high computational costs present a particular challenge for researchers
with limited resources, especially those from universities and smaller companies. The goal of our
work is therefore to cut the cost of training: we want to train transformer-based video models with
fewer resources or use larger model variants and handle more frames with the same resources.
We have chosen the ViViT model [1] as our baseline upon which to improve. Specifically, we focus
on the “factorized encoder” variant of ViViT which has separate spatial and temporal transformer
stages, where the spatial transformer is responsible for extracting features from individual frames,
while the temporal transformer processes the temporal dynamics across frames. We choose this
factorized encoder design because it is more efficient compared to, e.g., the variant of ViViT using
all-to-all spatiotemporal attention, while still achieving high accuracies and has thus been adopted as
the building block for recent state-of-the-art architectures on various tasks [46, 7, 43, 45, 47, 18].
To address the challenge of reducing training time and memory usage without compromising the
sophistication and accuracy of the original model, our approach is based on the simple idea of freezing
the spatial backbone. Freezing the spatial backbone has many advantages: by not backpropagating
through this transformer, training is faster and requires less memory (allowing for the model to
handle more frames). We also inherit the benefits of pretraining the spatial transformer on a large
dataset (such as JFT [19]). Naively implemented, however, we show that this approach falls very
short in accuracy. Instead, with a few simple (but important) tweaks to the above idea, we propose a
method that has the same advantages of freezing the spatial transformer, but does not compromise on
accuracy.
Our method proceeds in two stages. In the first stage we pretrain a cheap version of the model
using fewer frames, e.g., 8 frames as opposed to, e.g., 32 frames. In the second stage we fine
tune this model with more frames, which is more expensive but in this stage we freeze the spatial
encoder and introduce a compact “adapter” model connecting frozen spatial representations to the
temporal transformer, negating the need for end-to-end training of the spatial transformer. Critically
this includes pre-training the temporal transformer (by initializing from stage 1) which is often
overlooked in current video models which typically initialize this component from scratch. However
our experiments show that this step is critical if we wish to not sacrifice performance.
Drawing parallels with curriculum learning [2], our methodology can also be viewed as progressively
training on tasks of increasing complexity, beginning with a ViViT model pre-trained on 8 frames —
our “easy examples”. As we progress, the model effectively handles larger frame counts up to 128
frames - our “difficult examples”. This approach not only sustains the intricacy of the original model
2
but also significantly reduces resource demands. Thus our approach enables entities with limited
resources to emulate high-performance models using affordable GPUs.
With our training recipe, we match or slightly outperform conventional training of ViViT at roughly
half the cost as seen in Figure 1. A notable benefit of our training recipe, is its ability to process up
to 80 frames on typical university-grade GPUs, a significant leap from the previous capacity of 16
frames. This expansion in processing power broadens the range of video data manageable under
resource-constrained settings. As we elaborate in Section 4.8, our research underscores the potential
to democratize access to advanced video transformer models. Another notable benefit is the model’s
ability to now use even larger models as the spatial transformer, we introduce ViViT-g as seen in
Section 4.7. This accessibility paves the way for future video action recognition research, irrespective
of resource constraints. Hereafter, we refer to our version of ViViT as SFA-ViViT, where SFA denotes
’Spatial Frozen and Adapter Initialized’.
2 Related Work
Transformers for Videos Action recognition is a key research area in computer vision, addressed
by many traditional [23, 41] and CNN based approaches [6, 16, 21, 25, 34, 39, 44, 15] aided by
the release of large-scale datasets [6, 22, 35]. Since we focus on transformer based architectures, a
thorough review of earlier methods are out of this scope. More recently, the transformer architecture,
initially developed for NLP tasks [40], has been adapted for video understanding and action recogni-
tion tasks, leading to state-of-the-art models such as TimeSformer [3], ViViT [1], VideoSwin [27],
and Uniformer [24] These transformer-based models leverage self-attention mechanisms to capture
complex spatiotemporal patterns in action recognition tasks. TimeSformer [3] is one of the first
transformer-based models for video understanding, adapting the transformer architecture to video
by treating it as a sequence of flattened image patches. ViViT [1] integrates spatial and temporal
transformers to efficiently capture spatiotemporal information in video sequences. VideoSwin [27] is
a hierarchical transformer that applies local windowing for efficiency, enabling the model to handle
longer video sequences. VideoBERT [36] is a transformer model that learns joint representations of
video and language in self-supervised manner, which can be fine-tuned for various video understand-
ing tasks, including action recognition. More recently, Uniformer [24] integrates 3D convolution
and spatiotemporal self-attention, MTV [46] proposes a multi-view transformer model using dis-
tinct encoders for each video “view”, improving accuracy as the number of views increases. The
Multiscale Vision Transformers (MViT) [13] model streamlines computation and memory usage by
operating at different resolutions, focusing on high-level features at lower resolutions and low-level
details at higher ones, effectively leveraging both spatial and temporal information in visual tasks.
TubeViT [31] introduces a method of sparsely sampling different-sized 3D segments from videos,
facilitating efficient joint image and video learning, and allowing the adaptation of larger models
to videos with less computational resources. Typically these models have FLOPs in the range of
TFLOPs and training times that last more than days on the largest of GPUs/TPUs available, making
them infeasible to train or use in lower resourced settings such as academia. It is critical that we
find a way to train these models with limited resources while maintaining their performance. To this
end, we focus on the factorised encoder version of ViViT as the late-fusion approach followed is
used as a foundation for state-of-the-art approaches of various tasks [46, 7, 43, 45, 47, 18] and hence
believe that the initialization scheme proposed can be used for future methods working on similar
architectures.
Efficient Transformers in Videos Efficiency is a nuanced topic [10], as there are multiple cost
indicators of efficiency (for example, GFLOPs, inference time, training time, memory usage), and
models which improve efficiency in one dimension, are not necessarily better in other dimensions [10].
TokenLearner [33] proposes a method that adaptively learns tokens for efficient image and video
understanding tasks, enabling effective modeling of pairwise attention over longer temporal horizons
or spatial content. TokenLearner reduces the GFLOPs required by ViViT by about half, but does
not significantly change the training time or the inference time of ViViT. Spatial Temporal Token
Selection (STTS) [42] proposes a dynamic token selection framework for spatial and temporal
dimensions that ranks token importance using a lightweight scorer network, selecting top-scoring
tokens for downstream evaluation in an end-to-end training process. STTS again reduces the GFLOPs,
but the training time and inference time do not change significantly. TokShift [49], a zero-parameter,
zero-FLOPs operator that models temporal relations in transformer encoders by temporally shifting
3
partial token features across adjacent frames but again requires the same training time as the original
model. By densely integrating TokShift into a plain 2D vision transformer, a computationally efficient,
convolution-free video transformer is created for video understanding. Most similar to our work is
the ST-Adapter [30], that utilizes built-in spatio-temporal reasoning in a compact design, allowing
pre-trained image models to reason about dynamic video content with a small per-task parameter
cost, surpassing existing methods in both parameter-efficiency and performance. However, it does
not change FLOPs or inference time at all. Unlike ST-Adapter, we use a spatial only adapter which
we show is enough to reproduce the performance of the baseline model at close to half the training
time. In particular, our proposed method improves the training time and training memory usage,
addressing the key problem of researchers and practitioners being able to train video models. It does
not, however, change the inference time compared to a standard ViViT model. We consider overall
train time for the same hyperparameters and use the same hardware for a direct comparison. We
consider efficiency in this paper as the time saved in the overall training of the model.
3 Methodology
3.1 Revisiting ViViT
The Video Vision Transformer (ViViT) extends the Vision Transformer architecture to handle video
data by incorporating spatio-temporal reasoning. The idea behind ViViT is to process video input
as a sequence of image patches, combining spatial and temporal information through a series of
transformer layers, which include multi-head self-attention, layer normalization, and feed-forward
networks. The output is used for video classification.
In the “vanilla” variant of ViViT, one extracts spatio-temporal tokens from a video then forwards all
tokens through a transformer encoder which explicitly models all pairwise interactions between all
spatio-temporal tokens. We build off of the more efficient “Factorized Encoder” variant of ViViT
whose architecture consists of two separate transformer encoders, a spatial transformer modeling
interactions between tokens from the same temporal index and a temporal transformer modeling
interactions between tokens from different temporal indices. Despite having more parameters, it
requires fewer floating point operations (FLOPs) than vanilla ViViT. Because the Factorised Encoder
variant strikes a good balance point between accuracy and processing speed, it has also been adopted
as the foundation for other architectures[46, 7, 43, 45, 47, 18], reinforcing its utility and robustness.
3.2 Our training strategy
We concentrate on the factorised encoder variant of ViViT as it is already the most efficient version
of the baseline. Henceforth, when we talk about ViViT we refer to this variant of ViViT. Consider the
ViViT model that contains a spatial transformer with parameters θspatial and a temporal transformer
with parameters θtemporal :
Xspatial = Tspatial (Xin ; θspatial ) (1)

Xout = Ttemporal (Xspatial ; θtemporal ).
In conventional ViViT training, θspatial is initialized from an image pre-trained checkpoint such as
ImageNet-21k [32] or JFT [19] and the θtemporal is initialized from scratch. During backpropagation,
the gradient flows through the entire model. This entails training two sizable transformer models
end-to-end, which is a highly resource-intensive process, as the transformer architecture is inherently
computationally demanding, especially with more frames and larger ViViT variants (e.g., ViViT-H).
One approach to reducing training time is to freeze the parameters of the spatial transformer θspatial .
By not backpropagating through θspatial , gradient updates are faster and require less memory,
allowing us to access more frames without encountering out-of-memory issues. But as we show in
experiments, the accuracy of the resulting model with frozen θspatial is not competitive (in accuracy)
with the baseline training approach.
We present a two stage approach (see Fig. 2) to training ViViT models that inherits the same benefits
of freezing the spatial transformer, while not compromising on model quality.
4
Figure 2: STAGE 1: We first use the full ViViT-FE model on 8 frames by initializing the spatial
transformer from an image checkpoint and the temporal transformer from scratch. STAGE 2: We
then use this as our checkpoint to initialize the spatial and temporal transformer for models using
more frames (such as 32, 64 or 128). We then freeze the spatial transformer and add an adapter
model to finetune spatial transformer features. The temporal transformer is finetuned from the same
checkpoint.
Stage 1. In Stage 1, we pretrain our ViViT model on a reduced number of frames initializing the
spatial transformer using a pre-trained image checkpoint. We do not freeze the spatial transformer
during this stage, but critically, Stage 1 serves to also initialize the temporal transformer.
To set the number of frames at this stage, we must balance the goal of efficiency (using fewer frames)
against our finding in experiments that pre-training on too few frames can lead to suboptimal results.
In our ablations, we identify a sweet spot at 8 frames.
Stage 2. In Stage 2, we fine tune our ViViT model on the full frame count (e.g. 128 frames)
initializing both spatial and temporal transformer parameters learned in Stage 1. Because this stage is
significantly more expensive, in stage 2, we freeze the spatial transformer parameters θspatial and
add a lightweight adapter module with parameters θadapter following the spatial transformer:
Xspatial = Tspatial (Xin ; θspatial )
Xadapter = Aadapter (Xspatial ; θadapter ) (2)
Xout = Ttemporal (Xadapter ; θtemporal )
In this setting, by backpropagating only through the temporal transformer and the lightweight adapter
module (in our experiments, a two layer MLP), we effectively cut total training time by half.
The crucial finding here is that the spatial transformer requires only short-term context for initialization
(after which it remains frozen), whereas the temporal transformer necessitates long-term context
to achieve its optimal performance. Further details and empirical analysis can be found in the next
section.
4 Experimental Analysis
Through a series of comprehensive experiments which we now present, we investigate the significance
of the spatial transformer, examining the impact of pre-training datasets and how larger models
affect action recognition performance. We also explore the importance of initializing the temporal
transformer by employing various initialization schemes and datasets, assessing whether the number
5
Index Spatial Frozen Adapter Temporal Frozen Temporal Init Top-1 Acc Top-5 Acc Train Time
- × × × × 64.45 87.48 14.17 h
I ✓ × × × 27.75 56.73 0.5x()
II × × ✓ × 25.80 53.07 0.5x()
III ✓ ✓ × × 38.77 68.93 0.53x()
IV ✓ ✓ × V M AE 58.54 85.83 2.51x()
V ✓ ✓ × V iV iT − 8f 63.85 87.62 0.62x()
Table 1: Ablation study results illustrating the impact of various modifications to the ViViT-B model,
including spatial and temporal transformer freezing, adapter addition, and initialization methods, on
top-1 and top-5 accuracy. Dataset is Something-something v2.
of frames is critical for initializing larger models, initializing full ViViT models, and initializing
models on one dataset while fine-tuning on another.
4.1 Datasets
We evaluate on all the datasets considered in [1] (specifically, Kinetics-400 [6], Kinetics-600 [5],
EPIC-Kitchens [9], Something-something v2 [17] and Moments-in-time [29]) as well as the
Something-Else [28] dataset. As these datasets are common in the community, we include fur-
ther details in the supplementary.
4.2 Implementation Details
We use Scenic [11] for our implementation. Since we build on ViViT, we directly work on top of the
codebase and stick to the default parameters used by ViViT in terms of hyperparameters. Full details
of these can be found in supplementary.
Our adapter is a two-layer fully connected network that takes as input the output from the spatial
transformer and the output from the adapter is passed as input to the temporal transformer.
The hyper-parameters of the transformer models are set to the standard: the number of heads are
12/16/16/16, number of layers are 12/24/32/40, hidden sizes are 768/1024/1280/1408 and MLP
dimensions are 3072/4096/5120/6144 for the base/large/huge/giant versions respectively. The 8-
frame ViViT model is trained for 30 epochs. We also experiment with initializing larger models with
an 8-frame model trained for 10 epochs. Details of this can be found in the supplementary.
For our hardware, we use 64 v3 TPUs for all experiments. However, we also show results using 8
NVIDIA GeForce 2080 Ti (w/12 GB memory). This is a typical setting in a small academic lab.
4.3 Ablation Study
We first address two critical aspects: the significance of fine-tuning the spatial transformer and the
importance of initializing the temporal transformer. To do so, we conduct a series of experiments
in various scenarios, which are detailed below. Our analysis focuses on the Something-something
dataset, utilizing the large version of the ViViT model, referred to as ViViT-L.
We examine four main elements that modify the structure of conventional ViViT training and these
are mentioned with indices in Table 1 namely: I. The freezing of the spatial transformer (θspatial is
initialized and then frozen), II. The freezing of the temporal transformer (θtemporal is frozen), III.
The addition of an adapter (lightweight module with parameters θadapter ), IV. Next, we initialize the
temporal transformer using VideoMAE[38], while keeping the spatial transformer frozen and the
adapter incorporated and V. The initialization of the temporal transformer (θspatial and θtemporal are
initialized using the 8-frame version of the baseline).
It is important to note that the VideoMAE training is an extremely expensive process as can be seen
in the table. But combined with the line below it, these two models, which significantly outperform
lines I, II and III, show that properly initializing the temporal transformer is the critical issue at hand.
Additionally, initializing the spatial transformer yields further improvement. The adapter plays a vital
role in augmenting performance when the spatial transformer is frozen, and due to its lightweight
nature, it will be an essential component of our training methodology moving forward.
6
Figure 3: The effect of initializing with dif-
ferent numbers of frames (JFT, 2, 4, 8, 16, Figure 4: Comparison of our initialization method
32, and 48), freezing the spatial transformer vs conventional training of ViViT on Top-1 accu-
and adding an adapter model and fine-tuning racy and loss on the Kinetics-400 dataset using 64
using 64, 96, and 128 frames. Results on Ki- and 128 frames. We see that our initialization gives
netics400 dataset, ‘f’ refers to frames. a significant headstart to the models.
4.4 How many frames should we use for Stage 1?
Next we experiment with various frame counts for stage 1 training, We test seven variants: JFT [19]
checkpoint (image-based), 2, 4, 8, 16, 32, and 48-frame ViViT checkpoints. We then fine-tune
these with a frozen spatial transformer and add an adapter model using 64, 96, and 128 frames
(see Figure 3). Results show that using too few frames for Stage 1 training can underperform (with
image-only initialization from a JFT [19] checkpoint performing the worst). Thus we deduce that
short term temporal context is essential for initializing the spatial transformer. Performance also
plateaus after 8 frames, and given that using more frames increases training time, we settle on using 8
frames as our “sweet spot” for Stage 1 training.
Backbone Top-1 Top-5 Steps

Checkpoint SSv2 K400
ViViT-L 79.64 91.73 48k steps
K400-init 44.71/74.53 82.81/93.98
ViViT-H 81.02 93.09 39k steps
SSv2-init 63.85/87.62 76.79/92.35
ViViT-g 81.81 94.55 29k steps
Table 2: A summary of cross-dataset initial-
Table 3: A comparison of top-1 and top-5 ac-
ization of the proposed model and perfor-
curacies for the ViViT-g model with the pro-
mance comparison. We use Kinetics400 and
posed training strategy, which incorporates a
Something-something v2 as our datasets.
larger spatial transformer backbone. All mod-
els use 48 frames for fair comparison. Results
are on Kinetics400 dataset.
4.5 Does the proposed training increase convergence speed?
Another potential question that may arise concerns the impact of this initialization method in terms of
convergence speed, if any. This specific aspect holds considerable significance owing to its potential
ability to drastically curtail the duration of time required for training and the number of epochs
necessary to effectively train the model. Moreover, an important element to take into account is
the effect of freezing the spatial transformer. This approach decreases the memory needed to store
the model but also considerably enhances the training speed. To provide a clearer picture, we have
plotted the validation curves with and without initialization, which can be seen in Figure 4. Note that
with the proposed initialization, we get a significant head start in overall accuracy.
7
Model 4 8 16 32 48 64 96 128
ViViT-H ✓ ✓ ✓ ✓ ✓ ✓ ✓ ×
SFA-ViViT-H ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
ViViT-g ✓ ✓ × × × × × ×
SFA-ViViT-g ✓ ✓ ✓ ✓ ✓ × × ×
Table 4: Memory usage in different ViViT training schemes is compared using the Kinetics400
dataset on 64 TPUs v3 with 16GB memory each. A ✓ indicates accessible frames given hardware
constraints, while a × signals an out-of-memory (OOM) error.
4.6 What about initializing on one dataset and finetuning on other?
In our study, we find training an 8-frame version of the standard ViViT model affordable. We
consider training this on a temporally dependent dataset like Something-something, then fine-tuning
on other datasets like Kinetics-400. We examine two scenarios: first, the standard ViViT model
trained on Kinetics-400 using 8-frames, and second, the same model trained on Something-something.
Post-training, we freeze the spatial transformer, add an adapter, and fine-tune the models on the
alternate dataset with more frames. We contrast this with models fine-tuned on their original datasets
(see Table 4.4). Results favor initializing larger frame models from the 8-frame version on the same
dataset. Thus, for the final comparison in Sec 4.9, we initialize models with 8-frame versions of the
baseline.
4.7 Extending the image backbone to 1.5B parameters
An intriguing consequence of our approach is the ability to incorporate larger backbones into the
spatial transformer, made possible by the additional memory available to us as a result of freezing
the spatial transformer during training. Consequently, we introduce ViViT-g, which integrates the
ViT-g model (with 1.5B parameters) as its backbone. To ensure a fair comparison, we focus solely on
training and inference using 48 frames, and abstain from employing multiview or multicrop testing.
Our objective is to investigate the potential impact of a more substantial spatial transformer backbone
on the overall performance and show the potential of larger spatial backbones that are possible due to
our training process.
It is essential to note that the full ViViT-g model could not process more than 8 frames due to memory
limitations. However, our proposed strategy allows processing up to 48 frames. A comparison of
top-1 and top-5 accuracies is presented in Table 4.4 along with the number of steps needed to reach the
best performance. Dataset used is Kinetics-400 and all the ViT checkpoints are JFT-pretrained [19].
4.8 Comparison of Memory Usage with Standard ViViT Training and Proposed Method
In this study, we compare the number of frames that can be accessed using the standard ViViT
training scheme against our proposed scheme, employing a set of 64 v3 TPUs that have 16 GB each.
We further evaluate the performance of ViViT variants, including H, and g, in comparison with the
SFA-ViViT using the same variant configurations. Maintaining identical hyperparameters, we ensure
a local batch size of 1.
Our findings indicate that the conventional ViViT training approach restricts frame accessibility to 96
frames for the ViViT-H model, and a mere 8 frames for the ViViT-g model, before reaching memory
limitations.
Conversely, our proposed method enables access to 128 frames for ViViT-H, and up to 48 frames
when utilizing ViViT-g with the same hardware. Furthermore, we investigate the impact of utilizing
university grade GPUs by conducting ViViT experiments on an NVIDIA Tx 2080 Ti GPU farm
equipped with 8 GPUs having 12 GB each. Under these circumstances, ViViT can only process
16 frames using a local batch size of 1. However, our proposed training strategy enables a notable
improvement, expanding the frame capacity to 80 frames helping us reproduce ViViT results on lower
end GPUs. This enhancement provides a valuable opportunity for researchers with limited resources
to attain performance levels comparable to those with extensive resources. We show a comparison of
number of frames accessible with and without our training recipe in Table 4.
8
Model Kinetics-400 Kinetics-600 Moments in Time
Accuracy Train Time Accuracy Train Time Accuracy Train Time
ViViT-L 82.59/93.09 1x(21.57 h) 83.29/95.82 1x(26.14 h) - - -
ViViT-L + SFA 82.78/94.03 0.56x() 83.47/95.29 0.56x() - -
ViViT-H 84.21/94.66 1x(56.71 h) 84.18/95.68 1x(60.45 h) 38.17/62.84 1x(110.79 h)
ViViT-H + SFA 84.42/94.72 0.57x() 84.39/96.20 0.57x() 39.96/64.39 0.59x()
Table 5: Performance Comparison of various versions of ViViT with the proposed training strategy
for Kinetics-400, Kinetics-600 and Moments in Time. Accuracies listed as Top-1/Top-5.
Model Something-something Something-Else Epic-Kitchens

Accuracy Train Time Accuracy Train Time Accuracy Train Time
ViViT-L 64.45/87.48 1x(14.17 h) 53.14/73.98 1x(3.84 h) 43.53/56.55/65.40 1x(5.61 h)
ViViT-L + SFA 63.85/87.62 0.62x() 53.60/74.47 0.62x() 43.54/56.78/65.16 0.63x()
Table 6: Performance Comparison of ViViT-L with the proposed training strategy for Something-
something v2, Something-Else and Epic-Kitchens. Accuracies listed as Top-1/Top-5, for Epic
Kitchens Top-1 noun-verb/ Top-1 noun/ Top-1 Verb.
4.9 Comparison on all benchmarks to the baseline model
In this section, we present a comprehensive comparative analysis, focusing on the proposed approach
and the baseline model. We report the Top-1 accuracy, Top-5 accuracy and the overall training time.
The evaluation is conducted on the large and huge variants of ViViT across three datasets, namely
Kinetics400, Kinetics600, and Moments in Time (MiT), with the summarized results tabulated in
Table 5. The findings indicate a slight enhancement in accuracy for both Kinetics400 and Kinetics600
datasets, whereas a notable 1.79% increase in top-1 accuracy is observed for the MiT dataset using
the proposed method.
Furthermore, the proposed approach showcases a significant reduction in training time, accounting
for approximately 56% of the original duration. This reduction emphasizes the advantageous nature
of the proposed approach. To calculate the total training time for the SFA version, the train time of
the 8 frame (Stage 1) ViViT model is combined with the train time of the (Stage 2) SFA-ViViT model.
Conversely, the total training time for the standard ViViT encompasses the total train time for the
same number of frames that SFA-ViViT is trained on for fair comparison.
We further examine the performance of ViViT-L incorporating our proposed training strategy in
comparison to the original version on three additional datasets: Something-something, Something-
Else, and Epic-Kitchens. A consistent trend is observed, with the modified approach outperforming
the baseline model, at only a 62% cost of the baseline training time. In summary, our proposed
training strategy demonstrates promising potential by yielding comparable or slightly improved
performance across all datasets. This is obtained while maintaining a training cost ranging from 56%
to 62% of the original model, thus highlighting its effectiveness. Results can be seen in Table 6.
5 Limitations
Our research makes considerable progress in reducing training time and memory use for video
transformers, but it raises certain issues. First, training smaller versions of our model on different
datasets is required, adding an initial step. Ideally, a universal model applicable across datasets would
improve efficiency. Our method depends on separate space and time encoders, a feature of the ViViT
model, which might limit its use with integrated space-time models. We base our work on the ViViT
model used in influential models like MTV, highlighting its importance. While we didn’t test our
methods on models like MTV, focusing on ViViT provides beneficial implications for other models.
We hope this inspires future research and encourages further exploration in efficient training of video
transformers.
9
6 Conclusion
We have investigated the challenges posed by the substantial training time and memory consumption
of video transformers, particularly focusing on the factorised encoder variant of the ViViT model as
our baseline. To address these challenges, we proposed two effective strategies: utilizing a compact
adapter model for fine-tuning image representations instead of end-to-end training of the spatial
transformer, and initializing the temporal transformer using the baseline model trained with 8 frames.
Our proposed training strategy has demonstrated the potential to significantly reduce training costs
and memory consumption while maintaining, or even slightly improving, performance compared
to the baseline model. Furthermore, we observed that with proper initialization, our baseline model
can achieve near-peak performance within the first 10% of training epochs. The advancements made
in this work have the potential to propel research in the video understanding domain by enabling
access to more frames and the utilization of larger image models as the spatial transformer, all while
maintaining the same memory consumption. Our findings provide valuable insights for researchers
and practitioners with limited resources, paving the way for more efficient and scalable alternatives in
the action recognition field. Future work may focus on further optimizing and refining these strategies,
and exploring their application to other video transformer architectures and tasks in the computer
vision domain.
A ViViT hyperparameters
K400 K600 MIT Epic-Kitchens SSv2 Selse

Optimisation
Optimiser Synchronous SGD
Momentum 0.9
Batch size 128
Learning rate schedule cosine with linear warmup
Linear warmup epochs 2.5
Base learning rate 0.1 0.1 0.25 0.5 0.5 0.5
Epochs 30 30 10 50 35 35
Data augmentation
Random crop probability 1.0
Random flip probability 0.5
Scale jitter probability 1.0
Maximum scale 1.33
Minimum scale 0.9
Colour jitter probability 0.8 0.8 0.8 - - -
Rand augment number of layers [8] - - - 2 2 -
Rand augment magnitude [8] - - - 15 20 -
Other regularisation
Stochastic droplayer rate, pdrop [20] - - - 0.2 0.3 -
Label smoothing [37] - - - 0.2 0.3 -
Mixup [48] - - - 0.1 0.3 -
Table 7: The hyperparameters utilized in the experiments conducted for the primary research paper
are detailed here. If a regularisation method is not employed, it is represented by a "–". Constant
values that are present across all columns are mentioned just once. For simplicity, abbreviations
have been used to denote different datasets: Kinetics 400 is represented as K400, Kinetics 600
as K600, Moments in Time as MiT, Epic Kitchens as EK, Something-Something v2 as SSv2 and
Something-Else as Selse.
We have already mentioned the hyperparameters for the various transformer sizes used. In Table 7
we list the hyperparameters used for each dataset. For fair comparison we re-run SFA-ViViT using
the same hyperparameters as ViViT.
10
B Datasets
As Kinetics consists of YouTube videos which may be removed by their original creators, we note
the exact sizes of our dataset.
Kinetics-400 [6]: Kinetics-400 is a large-scale video dataset with 400 classes introduced by Google’s
DeepMind. It has 235693 training samples and 53744 validation and test samples. The dataset
encompasses various categories, such as object manipulation, human-object interaction, and body
movements. Each class contains approximately 400 video samples, with each video lasting around
10 seconds.
Kinetics-600 [5]: Kinetics-600 is an extension of the Kinetics-400 dataset, with an increased number
of classes, totaling 600 human action classes. This dataset contains approximately 380735 training
samples and 56192 validation and test samples. The additional classes broaden the scope of the
dataset, thereby providing more diverse training data for video recognition tasks.
EPIC Kitchens [9]: EPIC Kitchens is a large-scale dataset focusing on egocentric (first-person) videos
of daily kitchen activities. It consists of 55 hours of video captured by 32 different participants in their
own kitchens, with 67217 training samples and 22758 samples for validation and testing. The dataset
includes 97 verb classes and 300 noun classes. Epic Kitchens is particularly useful for understanding
human-object interactions and fine-grained actions in everyday settings.
Something-something v2 [17]: The Something-something v2 dataset is a collection of short video
clips focused on common objects and human actions. It contains around 168913 training clips and
24777 test clips distributed across 174 action classes. This dataset aims to capture more abstract and
high-level understanding of actions, as well as temporal relationships among objects.
Moments in Time [29]: The Moments in Time dataset is a large-scale video dataset containing one
million short video clips, each lasting three seconds. It covers 339 classes of dynamic events and aims
to provide a diverse set of visual and auditory representations of these events with 791297 training
samples and 33900 test samples. This dataset is particularly useful for understanding the temporal
aspects of various activities and events, as well as their associated contexts.
Something-else [28]: Something-Else utilizes the videos from SomethingSomething-V2 as its founda-
tion, and introduces novel training and testing partitions for two new tasks that examine the ability to
generalize: compositional action recognition and few-shot action recognition. Our attention is solely
on the compositional action recognition task, which aims to prevent any object category overlap
between the 54919 training videos and the 57876 validation videos.
C How important is the pre-training image dataset for action recognition

performance?
While we know from the original ViViT paper [1] that using larger ViT [12] backbones result in
better performances, we do a more thorough ablation here by considering variations of the ViT
model such as the hybrid ViT (ResNet-ViT-L pre-trained on ImageNet21k [32]), ViT-L pre-trained
on ImageNet21k, ViT-L pre-trained on JFT and ViT-H pre-trained on JFT. We report these results
in Table 8, with the conclusion of larger backbones pre-trained on larger datasets yields highest
accuracies. We report top-1 and top-5 accuracies on the Kinetics-400 dataset and we freeze the spatial
transformer here without any fine-tuning or adapter. We also keep the temporal transformer fixed in
size here for fair comparison. Essentially, the performance difference is purely from the output of the
spatial transformer changing due to different backbones.
D Curriculum Training
We consider variants of the “curriculum" training we talk about in the paper. There are various forms
that we can consider. For instance, we can train the standard ViViT 8 frame model for just 10 epochs
and use that to initialize our model. In the paper all initializations are done using 8 frame model
trained for 30 epochs. Further, we could initialize smaller versions of SFA-ViViT like a 32 frame
version for 10 epochs and then initialize SFA-ViViT 128 frames using this 32 frame version. We
plot this in Figure 5 and see various versions and conclude that in the end, the best speed-accuracy
11
Backbone 16-frames 32-frames 48-frames
ResNet-ViT-L (ImageNet21k) 66.09/88.30 66.63/88.50 66.88/88.65
ViT-L (ImageNet21k) 65.59/85.86 68.34/87.80 70.09/88.91
ViT-L (JFT) 69.76/88.41 73.98/90.88 75.08/91.69
ViT-H (JFT) 73.68/90.23 75.85/91.53 77.90/92.72
Table 8: Comparison of impact of different backbones for the spatial transformer. We use ResNet-
ViT-L pre-trained on ImageNet21k, ViT-L pre-trained on ImageNet21k, ViT-L pre-trained on JFT
and ViT-H pre-trained on JFT. Listed as (Top-1 accuracy/ Top-5 accuracy).
Figure 5: Stacked bar chart representing the cumulative processing times of Models A-E . Each color
within a bar corresponds to a specific sub-model (’a’ in yellow, ’b’ in blue, ’c’ in green, ’d’ in gray)
contributing to the total computation time of each model. Model accuracies are indicated at the top
of each respective bar. ’a’ = ViViT-L-8f, ’b’ = SFA-ViViT-L-32f, ’c’ = SFA-ViViT-L-128f, ’d’ =
ViViT-L-128f. All results are using Kinetics-400 dataset and using ViViT-L variants.
trade-off was obtained when the standard ViViT 8 frame model was trained on 30 epochs and then
the 128 frame model is initialized using this.
We define the models in the figure as follows:
• Model A: ViViT-L-8f for 10 epochs + SFA-ViViT-L-32f for 10 epochs + SFA-ViViT-L-128f

for 10 epochs
• Model B: ViViT-L-8f for 10 epochs + SFA-ViViT-L-128f for 20 epochs
• Model C: ViViT-L-8f for 10 epochs + SFA-ViViT-L-128f for 30 epochs
• Model D: ViViT-L-8f for 30 epochs + SFA-ViViT-L-128f for 30 epochs
• Model E: ViViT-L-128f for 30 epochs
We see that training the ViViT-L-8f model for the full 30 epochs and then using that to initialize
the SFA-ViViT-L-128f model gave us the best results. But we could potentially reduce the cost of
training to 0.25x if we sacrifice 2% accuracy. All results are on the Kinetics400 dataset.
12
Model Dataset NPP Epoch Best Epoch
ViViT-L K400 20 29
SFA-ViViT-L K400 5 28
ViViT-L K600 21 28
SFA-ViViT-L K600 5 23
ViViT-L SSv2 29 35
SFA-ViViT-L SSv2 4 24
Table 9: Comparison of near peak performance (NPP) epoch and best performance epoch for ViViT
and SFA-ViViT for different datasets and models. All results are on Kinetics400 dataset.
Model Dataset NPP Epoch Best Epoch

ViViT-L K400 20 29
ViViT-L init with 8f ViViT-L K400 4 25
ViViT-H K400 22 27
ViViT-H init with 8f ViViT-H K400 5 22
Table 10: Comparison of near peak performance (NPP) epoch and best performance epoch for
initializing the full ViViT model with and without the 8f variant. We see the benefit of initialization
as the “near-peak" performance is reached at a much earlier stage when initialized with the 8f variant.
All results are on Kinetics400 dataset.
E How long do we need to train the model?

We showed in the paper that using SFA based initialization helps us reach “near-peak” performance
really quickly. We define this near-peak performance as 1 % less than the eventual best performance
of the model. Thus another natural question is: in order to save time, why not stop training the SFA
version earlier? We note that although the standard ViViT model trains for ‘x’ epochs (see Table. 7 for
exact number), it often reaches this “peak” performance much earlier and hence for fair comparison
with the standard ViViT model, in the paper, we run on the same number of epochs. These results
can be seen in Table 9.
F What about initializing standard ViViT models?

Since our method proposes an initialization scheme, we also test it on the standard ViViT models
that do not have their spatial transformer frozen. In this particular scenario, we only want to check if
the peak performance can be reached faster. However, it is important to note that with our proposed
training scheme we also reduce the overall training time by close to half. This can be seen in Table 10.
References
[1] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid. Vivit: A video vision transformer.
In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
[2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th
annual international conference on machine learning, pages 41–48, 2009.
[3] G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video understanding? In
ICML, volume 2, page 4, 2021.
[4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection
with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer, 2020.
[5] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600.
arXiv preprint arXiv:1808.01340, 2018.
[6] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In
proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308,
2017.
13
[7] J. Chen and C. M. Ho. Mm-vit: Multi-modal video transformer for compressed video action recognition.
In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1910–1921,
2022.
[8] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation
with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition workshops, pages 702–703, 2020.
[9] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro,
T. Perrett, W. Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the
European Conference on Computer Vision (ECCV), pages 720–736, 2018.
[10] M. Dehghani, A. Arnab, L. Beyer, A. Vaswani, and Y. Tay. The efficiency misnomer. arXiv preprint
arXiv:2110.12894, 2021.
[11] M. Dehghani, A. Gritsenko, A. Arnab, M. Minderer, and Y. Tay. Scenic: A jax library for computer
vision research and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 21393–21398, 2022.
[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min-
derer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at
scale. arXiv preprint arXiv:2010.11929, 2020.
[13] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer. Multiscale vision
transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages
6824–6835, 2021.
[14] S. N. Gowda. Human activity recognition using combinatorial deep belief networks. In Proceedings of the
IEEE conference on computer vision and pattern recognition workshops, pages 1–6, 2017.
[15] S. N. Gowda, M. Rohrbach, F. Keller, and L. Sevilla-Lara. Learn2augment: Learning to composite videos
for data augmentation in action recognition. In Computer Vision–ECCV 2022: 17th European Conference,
Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXI, pages 242–259. Springer, 2022.
[16] S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara. Smart frame selection for action recognition. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1451–1459, 2021.
[17] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend,
P. Yianilos, M. Mueller-Freitag, et al. The" something something" video database for learning and
evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision,
pages 5842–5850, 2017.
[18] A. Gritsenko, X. Xiong, J. Djolonga, M. Dehghani, C. Sun, M. Lučić, C. Schmid, and A. Arnab. End-to-end
spatio-temporal action localisation with video transformers. arXiv preprint arXiv:2304.12160, 2023.
[19] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint
arXiv:1503.02531, 2015.
[20] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14,
2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
[21] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE
transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012.
[22] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human
motion recognition. In 2011 International conference on computer vision, pages 2556–2563. IEEE, 2011.
[23] I. Laptev. On space-time interest points. International journal of computer vision, 64:107–123, 2005.
[24] K. Li, Y. Wang, P. Gao, G. Song, Y. Liu, H. Li, and Y. Qiao. Uniformer: Unified transformer for efficient
spatiotemporal representation learning. arXiv preprint arXiv:2201.04676, 2022.
[25] J. Lin, C. Gan, and S. Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings
of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
[26] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision
transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 10012–10022, 2021.
14
[27] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu. Video swin transformer. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, pages 3202–3211, 2022.
[28] J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell. Something-else: Compositional action
recognition with spatial-temporal interaction networks. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 1049–1059, 2020.
[29] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, T. Yan, L. Brown, Q. Fan, D. Gutfreund,
C. Vondrick, et al. Moments in time dataset: one million videos for event understanding. IEEE transactions
on pattern analysis and machine intelligence, 42(2):502–508, 2019.
[30] J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li. St-adapter: Parameter-efficient image-to-video transfer learning.
Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
[31] A. Piergiovanni, W. Kuo, and A. Angelova. Rethinking video vits: Sparse video tubes for joint image and
video learning. arXiv preprint arXiv:2212.03229, 2022.
[32] T. Ridnik, E. Ben-Baruch, A. Noy, and L. Zelnik-Manor. Imagenet-21k pretraining for the masses. arXiv
preprint arXiv:2104.10972, 2021.
[33] M. Ryoo, A. Piergiovanni, A. Arnab, M. Dehghani, and A. Angelova. Tokenlearner: Adaptive space-time
tokenization for videos. Advances in Neural Information Processing Systems, 34:12786–12797, 2021.
[34] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos.
Advances in neural information processing systems, 27, 2014.
[35] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the
wild. arXiv preprint arXiv:1212.0402, 2012.
[36] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid. Videobert: A joint model for video and
language representation learning. In Proceedings of the IEEE/CVF international conference on computer
vision, pages 7464–7473, 2019.
[37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for
computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
2818–2826, 2016.
[38] Z. Tong, Y. Song, J. Wang, and L. Wang. Videomae: Masked autoencoders are data-efficient learners for
self-supervised video pre-training. arXiv preprint arXiv:2203.12602, 2022.
[39] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d
convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages
4489–4497, 2015.
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.
Attention is all you need. Advances in neural information processing systems, 30, 2017.
[41] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for
action recognition. International journal of computer vision, 103:60–79, 2013.
[42] J. Wang, X. Yang, H. Li, L. Liu, Z. Wu, and Y.-G. Jiang. Efficient video transformers with spatial-temporal
token selection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October
23–27, 2022, Proceedings, Part XXXV, pages 69–86. Springer, 2022.
[43] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang. Git: A generative image-to-text
transformer for vision and language. Transactions of Machine Learning Research.
[44] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks:
Towards good practices for deep action recognition. In European conference on computer vision, pages
20–36. Springer, 2016.
[45] X. Xiong, A. Arnab, A. Nagrani, and C. Schmid. M&m mix: A multimodal multiview transformer
ensemble. arXiv preprint arXiv:2206.09852, 2022.
[46] S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun, and C. Schmid. Multiview transformers for video
recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 3333–3343, 2022.
15
[47] A. Yang, A. Nagrani, P. H. Seo, A. Miech, J. Pont-Tuset, I. Laptev, J. Sivic, and C. Schmid. Vid2seq:
Large-scale pretraining of a visual language model for dense video captioning. In CVPR 2023-IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2023.
[48] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv
preprint arXiv:1710.09412, 2017.
[49] H. Zhang, Y. Hao, and C.-W. Ngo. Token shift transformer for video classification. In Proceedings of the
29th ACM International Conference on Multimedia, pages 917–925, 2021.
16

Optimizing ViViT Training: Time and Memory

Uploaded by

Copyright:

Available Formats

Optimizing ViViT Training: Time and Memory

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimizing ViViT Training: Time and Memory

Uploaded by

Copyright:

Available Formats

Optimizing ViViT Training: Time and Memory

Reduction for Action Recognition

Shreyank N Gowda∗1 , Anurag Arnab2 , and Jonathan Huang2

Preprint. Under review.

3.1 Revisiting ViViT

3.2 Our training strategy

Xspatial = Tspatial (Xin ; θspatial ) (1)

4.2 Implementation Details

4.3 Ablation Study

4.4 How many frames should we use for Stage 1?

Backbone Top-1 Top-5 Steps

4.5 Does the proposed training increase convergence speed?

4.6 What about initializing on one dataset and finetuning on other?

4.7 Extending the image backbone to 1.5B parameters

Model Something-something Something-Else Epic-Kitchens

4.9 Comparison on all benchmarks to the baseline model

K400 K600 MIT Epic-Kitchens SSv2 Selse

C How important is the pre-training image dataset for action recognition

• Model A: ViViT-L-8f for 10 epochs + SFA-ViViT-L-32f for 10 epochs + SFA-ViViT-L-128f

Model Dataset NPP Epoch Best Epoch

E How long do we need to train the model?

F What about initializing standard ViViT models?

You might also like