Nothing Special   »   [go: up one dir, main page]

Koch A Transformer-Based Late-Fusion Mechanism For Fine-Grained Object Recognition in Videos WACVW 2023 Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

A Transformer-based Late-Fusion Mechanism for Fine-Grained Object

Recognition in Videos

Jannik Koch1 Stefan Wolf2,1 Jürgen Beyerer1,2,3


1 2 3
Fraunhofer IOSB Vision and Future Lab Fraunhofer Center
Karlsruhe, Germany Karlsruhe Institute of Technology for Machine Learning
Karlsruhe, Germany Munich, Germany
firstname.lastname@iosb.fraunhofer.de

Abstract

Fine-grained image classification is limited by only con-


sidering a single view while in many cases, like surveil-
lance, a whole video exists which provides multiple per-
spectives. However, the potential of videos is mostly con-
sidered in the context of action recognition while fine-
grained object recognition is rarely considered as an ap-
plication for video classification. This leads to recent
video classification architectures being inappropriate for
the task of fine-grained object recognition. We propose a
novel, Transformer-based late-fusion mechanism for fine- Figure 1: Example images from a video [12] which is part
grained video classification. Our approach achieves su- of the YouTube-Cars [35] dataset. Besides the advantage of
perior results to both early-fusion mechanisms, like the multiple views, video classification enables the compensa-
Video Swin Transformer, and a simple consensus-based tion of inappropriate images like the lower left one which
late-fusion baseline with a modern Swin Transformer back- shows drastic motion blur.
bone. Additionally, we achieve improved efficiency, as our
results show a high increase in accuracy with only a slight
increase in computational complexity. Code is available at: instead of a single image.
https://github.com/wolfstefan/tlf. Video data has been successfully used as a source for
various classification tasks. The most common example of
this is the field of action recognition, where videos are ad-
1. Introduction vantageous because they provide important temporal infor-
mation. However, the potential of using a frame sequence
Fine-grained classification is an important task in the as opposed to a single image is not limited to the additional
context of surveillance since the identification of vehicles temporal component. As time progresses, a video usually
by licence plate is limited due to criminals often using provides different views of the objects in the scene, yield-
stolen licence plates. Thus, fine-grained vehicle classifi- ing additional features that can be exploited for classifica-
cation can be applied in a security context to identify ve- tion tasks other than action recognition. Fine-grained ob-
hicles by their make and model when an identification by ject recognition is such a task that is likely to profit heavily
licence plate fails. In real-world surveillance scenarios, a from multiple views. But the availability of multiple frames
single image is limiting fine-grained vehicle classification is also helpful to compensate images inappropriate for clas-
since motion blur can render important classification fea- sification because of e.g. blur. Nonetheless, only few works
tures unrecognisable. This can be compensated by using consider video classification for typical fine-grained object
videos for classification which are typically available any- recognition tasks like vehicle model classification or bird
way. Additionally, multiple views from different cameras species classification [1, 10, 24, 35]. This leads to state-of-
can be exploited to increase the accuracy by using a video the-art video classification models being tailored towards

100
7/) 2XUV fusion mechanism that efficiently aggregates the fea-
 tures of multiple input images of a video to enhance

 7LPH6IRUPHU the exploitation of the different views resulting in a


6ZLQ/DUJH significantly higher accuracy.
7RS

 6ZLQ%DVH • showing the advantage of a sparse sampling strategy


for fine-grained object recognition while this strategy
 is mostly considered outdated in video classification
 9LGHR6ZLQ%DVH research.
     • proving the effectiveness of video classification for
*)/23V fine-grained object recognition compared to single im-
age classification.
Figure 2: Comparing the computational complexity of our
TLF architecture to Swin Transformer [17] models of differ-
ent size with simple average fusion and the state-of-the-art 2. Related work
video classification models Video Swin Transformer [18] In this section, we first discuss the existing literature
and TimeSformer [3]. It shows the high efficiency of our in terms of video classification. Since most research in
approach. video classification is targeted towards action recognition,
we summarize the literature focused on fine-grained video
classification separately afterwards.
action recognition and ignoring the unique challenges of
fine-grained object recognition when processing videos. 2.1. Video classification
The goal of fine-grained classification is to successfully Video classification is mostly researched in the context
differentiate a set of highly detailed classes. For example, of action recognition which would be heavily limited by us-
in fine-grained vehicle classification Audi A5 Coupe 2012 ing single images. Thus, multiple datasets have been pub-
could be such a class. In contrast, in regular coarse-grained lished for video-based action recognition [4, 9, 11, 14, 25,
classification, cars usually share a single class and have to 27] and a large number of approaches has been proposed to
be distinguished from e.g. humans. Due to this degree of optimize the accuracy on these datasets.
class specificity in fine-grained classification, two classes
might share the vast majority of features, leading to minor Two-stream architectures. As the temporal data can be
differences being the deciding factor. conceptualized as another separate stream of information,
Video data can increase the number of visible differences two-stream architectures have been introduced as a viable
and thus, increase the classification accuracy. A major part option. Two-stream ConvNets [26] use both the RGB and
of using video data is the fusion mechanism used to com- optical flow of the input frames for separate classification
bine the input frames. Most state-of-the-art video classifi- tasks, the results of which are merged by a simple consensus
cation architectures like the Video Swin Transformer [18] mechanism. This late-fusion allows for a clean separation
use an early-fusion approach which interrelates frames as of two different backbone networks to extract the relevant
they pass through the backbone as this proved to be ad- features, but does not interrelate spatial and temporal infor-
vantageous for action recognition. Earlier approaches, like mation for the most part. If both domains are to be taken
Temporal Segment Networks [31] use a simple late-fusion into account simultaneously, the backbone needs to directly
consensus mechanism. We pick the concept of late-fusion work on a three-dimensional input.
up again and combine it with a modern self-attention-based Temporal Segment Networks [31] extend the two-stream
Transformer [30]. This results in our Transformer-based ConvNets by employing a sparse sampling strategy that ex-
late-fusion mechanism (TLF). tracts multiple snippets of a video to acquire more informa-
In the following sections, we demonstrate superior re- tion with each snippet containing a single RGB frame and a
sults to both state-of-the-art early-fusion and strong base- stack of optical flows.
line late-fusion models by applying our more sophisticated
late-fusion approach. We achieve an improvement in ac- 3D convolutions. Since 2D convolutions are a common
curacy without a significant increase in computational over- building block in classification architectures, convolutional
head, unlike the improvements resulting from a larger back- architectures extending this concept along the temporal
bone network as can be seen in Figure 2. axis, like C3D [29], have followed accordingly. 3D convo-
Our main contributions are: lutions and two-stream architectures are not mutually exclu-
sive, leading to two-stream convolutional approaches like
• proposing a sophisticated Transformer-based late- I3D [4].

101
Segment 1

Swin-Base Feature Add & Feed Add &

Feature Average Consensus


Frame
Backbone Vector Norm Forward Norm

Multi-Head Attention
Segment 2

Swin-Base Feature Add & Feed Add &


Frame
Backbone Vector Norm Forward Norm
Video

Classification
Segment 3

Swin-Base Feature Add & Feed Add &


Frame
Backbone Vector Norm Forward Norm

Swin-Base Feature Add & Feed Add &


Frame
Segment 4

Backbone Vector Norm Forward Norm

Sparse Random Feature Extraction Feature Aggregation with Transformer-based


Sampling Late-Fusion Mechanism

Figure 3: Schematic illustration of our Transformer-based late-fusion approach for fine-grained video classification.

Temporal Transformer. The recent trend of using self- while only few datasets [1, 10, 24, 35] exist for our applica-
attention for image recognition has also been picked tion of fine-grained object recogntion in videos limiting the
up for video classification. These architectures use the reserch progress.
Transformer mechanism to process sequences of images.
ViViT [2] extends the ViT [8] architecture to enable the pro- Fine-grained image classification. While fine-grained
cessing of image sequences. The authors propose multiple classification on single images can technically be real-
model variants including early-fusion and late-fusion meth- ized using conventional image classification methods, spe-
ods. Neimark et al. [21] also propose a late-fusion Trans- cialized architectures have emerged to improve their re-
former architecture. However, early-fusion approaches sults [15, 22, 34]. Initial models focused on the explicit
have prevailed due to being advantageous for action recog- identification of parts to distinguish the various classes, but
nition as shown by Arnab et al. [2]. While these late-fusion with the advent of these deep learning approaches, identify-
architectures are the most similar ones compared to our ing relevant features became both implicit and learnable.
Transformer-based late-fusion mechanism, we show that
fine-grained video classification has drastically different re- Fine-grained video classification. In the context of fine-
quirements leading to different design decisions. Video grained video classification, the research is significantly
Swin Transformer [18] continues the trend of early-fusion more narrow. Alsahafi et al. [1] use object detection to lo-
architectures with the extension of the shifted windows calize vehicles in videos and extract the relevant parts of
mechanism of Swin Transformer [17] to the temporal di- the images. Afterwards, an imagewise CNN and a simple
mension. The shifted windows reduce the computational fusion mechanism are used for classification. Redundancy
complexity while ensuring inter-token information shar- Reduction Attention [35] uses spatial and temporal atten-
ing. Bertasius et al. [3] propose TimeSformer as another tion to suppress redundant information in a video in an iter-
Transformer-based video classification architecture that di- ative manner.
vides the spatial and temporal attention to reduce complex-
ity and increase accuracy while using an early-fusion ap- 3. Method
proach due to being tailored towards action recogintion.
Our method is based on three parts which are illus-
2.2. Fine-grained classification trated in Figure 3. First, we extract frames from the in-
put video by a sparse sampling strategy to cover the full
This section is divided into fine-grained classification range of the video. Afterwards we extract features of the
based on images and videos. For the first, a large set of frames with a modern Swin Transformer [17] backbone. As
datasets [13, 28, 33] has been published motivating a vari- the last step, we apply a sophisticated Transformer-based
ety of algorithmic approaches. In contrast, the field of fine- late-fusion mechanism to derive a fine-grained classification
grained video classification is rather small. Most of the fine- score for each input video. Since fusion mechanisms imple-
grained video classification datasets and research works are mented in the backbone might fail to find meaningful fea-
about fine-grained action recognition [7, 14, 16, 23, 25] ture relationships early on, we postpone this operation to the

102
classification head. Late-fusion approaches like this usu- plemented by multiplying Q and K and scaling the result
ally rely on simple consensus mechanisms like averaging in relation to their dimensionanilty, followed by a softmax.
feature vectors, followed by a fully-connected, final classi- Scaling is required to prevent the dot-product from growing
fication layer. Temporal Segment Networks use this tech- too large in magnitude.
nique with great success [31], but have limited applicability The resulting matrix scores the importance of each input
in fine-grained classification due to their lack of attention element with a value between 0 and 1. This attention ma-
across frame boundaries. Our approach prepends a Trans- trix is then applied to V via matrix multiplication. Overall,
former encoder [30] to the consensus mechanism, which attention can be expressed concisely via Equation 1.
applies self-attention across all frames simultaneously, em-
phasizing important features. This self-attention provides \label {eq:attention} \attention (Q, K, V) = \softmax (\frac {QK^T}{\sqrt {d_k}})V (1)
an additional pathway to improving model accuracy in fine-
grained classification, as correctly distinguishing closely- Once the multi-head attention is defined, building a full
related classes might depend on very few features. Transformer encoder only requires two additional compo-
nents. First, a residual connection has to be inserted, fol-
Sampling. In the first step of our approach, we apply a
lowed by an addition and normalization step that combines
sparse sampling strategy that splits the video into a specific
the residual data with the results. Second, a feed-forward
number of segments and selects a random frame from each
network is appended to reduce the output dimension to the
segment. The set of frames is called N. Each resulting frame
desired shape.
is augmented as described in Section 4. While the number
of images is multiplied during inference due to augmenta- Transformer-based late-fusion. The feature vectors E are
tion, all preprocessed images are handled as a single set of all passed through the Transformer encoder simultaneously.
input images S. For our architecture, we use 8 heads within the multi-head
Sparse sampling is rarely used for video classification attention model and 1024 for both the input features and the
due to dense sampling being advantageous for action recog- dimension of the feed-forward network. Hence, the result-
nition which requires the extraction of short-term context. ing feature vectors still have the shape 1024x1x1. These
While Wang et al. [31] propose the sparse sampling strat- feature vectors are averaged to provide a single 1024x1x1
egy for action recognition by sampling the RGB frames feature vector to the classification stage.
sparsely across the video, they still employ some form of Compared to a simple average fusion without the Trans-
dense sampling by using stacks short-term optical flows former encoder, our Transformer-based late-fusion mecha-
in a two-stream architecture. However, fine-grained object nism enables a more sophisticated aggregation. It can rep-
recognition is profiting from using videos in a different way resent interdependencies between the views of images that
and thus, the widely applied dense sampling is inferior to a can not be represented by a simple linear aggregation of the
sparse sampling strategy as we show in our experiments. features.
Feature extraction. Each input image is fed separately into Classification. Once the consensus has been applied, the
the Swin Transformer [17] backbone. The result is a fea- resulting feature vector is fed into a dropout and a final
ture vector with the shape of 1024x7x7 for each augmented fully-connected layer to classify the video. Afterwards, a
image. Afterwards, each feature map is reduced to a fea- softmax is applied to normalize the output scores.
ture vector of shape 1024x1x1 via adaptive average pooling.
The total set of feature vectors of all augmentations from all 4. Experiments
frames is called E.
In the following paragraphs, the implementation, results
Self-attention. The self-attention mechanism central to our and evaluation of our experiments will be discussed.
late-fusion mechanism is part of the Transformer architec-
4.1. Settings
ture [30]. Specifically, self-attention is achieved via the
multi-head attention block. Usually, this block receives dif- Optimization. The AdamW [20] optimizer with a Cosine
ferent query Q, key K and value V matrices. In the case of Annealing [19] policy was consistently used during train-
self-attention, all these matrices are identical, meaning they ing. The learning rate was also kept consistent at a base
will equally be set to the input features. value of 10−4 when combined with a batch size of 8. Due
Each head learns its own set of linear transformations to VRAM limitations, this batch size had to be halved to 4
that are applied to the input data, followed by a scaled dot- in some experiments. In these instances, the learning rate
product attention mechanism. All head outputs are even- was also halved accordingly to 5 · 10−5 . All experiments
tually concatenated and linearly transformed to the desired use a weight decay of 10−2 . These values are based on the
output dimension. The scaled dot-product attention is im- defaults used in mmaction2 [6].

103
Video decoding. The Decord [5] video loader used by 4.2. Dataset
mmaction2 provides two different modes of operation: effi-
The YouTube-Cars dataset [35] is used to evaluate our
cient and accurate. Choosing the efficient mode reduces the
architecture, since it provides video data with fine-grained
time it takes to extract random frame samples, as Decord
labels. Additionally, the YouTube-Birds dataset provided
then utilizes a fast, inexact random seek algorithm that only
by the same authors is used for additional validation of the
returns Intra-Frames (or I-Frames). The drawback is the
model’s efficacy. Experiments are done on the YouTube-
possibility of receiving the same frame twice when two
Cars dataset if not mentioned otherwise. YouTube-Cars
samples are sufficiently close to each other. We chose to
provides video data for 196 classes and YouTube-Birds
employ the efficient mode only during training, if samples
for 200 classes, with the class selection being identical to
are drawn in a sparse fashion. For dense sampling and
Stanford Cars [13] and CUB-200-2011 [32], respectively.
during testing, accurate sampling was used. This yielded
The full car dataset contains 10,238 videos for training and
a compromise of lowered training times allowing possibly
4,855 videos for testing purposes while the bird dataset pro-
duplicated frames and preventing frame duplications during
vides 12,666 training and 5,684 testing videos.
evaluations.
As YouTube is an inherently unreliable data source,
video availability is never guaranteed. Thus, some of the
Sampling. Processing all images in a video is inappropriate videos of the datasets were not available anymore when we
for classification due to limited compute resources and re- feteched the dataset. Hence, any comparison of our work
dundancy of consecutive frames. Thus, a sampling strategy to the results in the original paper is limited in its validity
has to be applied to select a number of images that can be re- and should be considered tentative. However, both datasets
alistically handled and that are most appropriate for an accu- were kept consistent during our experiments. While some
rate classification. In video classification, sparse and dense videos were not available anymore, most of the data could
sampling are the common sampling strategies. In sparse still be fetched and no class had to be removed due to a lack
sampling, the video is divided into k parts and from each of footage.
part, a single image is selected randomly. In dense sam- 4.3. Comparison with state-of-the-art
pling, the video is also divided in k parts. However, from
each part, a contiguous sequence of images with length l To prove the effectiveness of TLF, we compare our ap-
and stride s is chosen randomly. For experiments of dense proach against Swin Transformer [17] with a simple feature
sampling, we report the parameters as l × s × k. average consensus as a strong baseline model and the state-
of-the-art video classification models Video Swin Trans-
former [18] and TimeSformer [3]. While we also include
Augmentations. During training, random horizontal flip is published results on the YouTube-Cars dataset [35], the re-
used per video and each sampled input image is augmented sults are not directly comparable due to some videos not
with a random crop. Afterwards, the frame is resized to being available anymore as described in Section 4.2. We
224x224 pixels. Each random crop has a random position used the best of all evaluated sampling strategies and num-
in the frame, with all possible positions being equally prob- ber of samples for each model. The impact of the sam-
able. The dimension is calculated based on a given aspect pling strategy is described in Section 4.4. The results of the
ratio and the total area covered, both of which are chosen comparison with the state-of-the-art are shown in Table 1
randomly within a given interval. In our case, the aspect and indicate an advantage of the baseline Swin Transformer
ratio is chosen randomly in the interval [ 34 , 43 ] and the area model with a feature averaging fusion over the Video Swin
in [0.08, 1], with the area being interpreted as a percentage Transformer which is a state-of-the-art video classification
of the total frame size. For testing, each frame is cropped model. The Video Swin Transformer only performs well
5 times: once in each corner and once in the center of the for tasks requiring an analysis of short sequences like action
frame. This time, all crops have a static width and height of recognition but not for tasks covering long range correspon-
224 pixels. Additionally, each augmentation is duplicated dences in videos. This highlights the different requirements
and flipped along the vertical axis, yielding 10 augmenta- of action recognition as the prime task for video classifica-
tions in total. Once an input frame has been augmented and tion research and fine-grained object recognition which has
resized to the input dimension of 224x224 pixels, the results not received the same attention in research yet. The TimeS-
are passed into the backbone network. former model can outperform our baseline slightly since it
can make appropriate use of sparse sampling. However, due
For Video Swin Transformer [18] and TimeSformer [3], to the Transformer-based fusion being applied in an early-
we use a three crop strategy as intended by their authors. fusion manner, the architecture still cannot make use of the
However, preliminary experiments have shown that the dif- Transformer to ist full advantage. In comparison, our sim-
ferences between three crop and ten crop are negligible. ple TLF mechanism combined with a Swin-Base backbone

104
Architecture Top-1 Top-5 #parameters (106 ) FLOPs (109 )
Swin Transformer Base [17] 76.1 93.7 86.9 969.0
Swin Transformer Large [17] 76.9 94.5 195.3 2178.5
VideoSwinTransformer Base [18] 71.9 90.6 87.8 485.1
TimeSformer [3] 77.9 94.6 121.5 807.2
Transformer-based Late-Fusion (Ours) 80.6 96.0 93.3 969.1
Inflated 3D Convolutional Neural Network (I3D)* 40.9
Batch-Normalized-Inception (BN-Inception)* 62.0
Temporal Segment Network (TSN)* 74.3
Redundancy Reduced Attention (RRA)* 77.6
*
Original results by the authors of the YouTube-Cars dataset [35]. Results are not directly comparable since
not all videos are available anymore.

Table 1: Results of different classification architectures on the YouTube-Cars dataset [35]. The state-of-the-art video classifi-
cation Video Swin Transformer performs poorly for fine-grained vehicle recognition due to the early-fusion design optimized
for action recognition. A modern single image backbone with a simple late-fusion average consensus achieves better results.
In comparison, our Transformer-based late-fusion mechanism shows a significantly higher accuracy.

7/) 2XUV Architecture and sampling. We found the sampling of the


 video to be a deciding factor for the accuracy of the classifi-
 7LPH6IRUPHU cation. In Table 2, sparse and dense sampling with different
6ZLQ/DUJH number of sampled images during training and testing are
7RS

 compared for different models. For the Swin-Base model


6ZLQ%DVH without TLF and with sparse sampling, we see a significant
 increase in accuracy with the number of testing samples in-
creased from 8 to 32 and a slight increase when the number
 9LGHR6ZLQ%DVH
of training samles is increased from 8 to 16. Increasing the
   number of testing samples further to 64 does not lead to
3DUDPV 0 another performance improvment. This increase can be ex-
plained by the higher number of perspectives available with
Figure 4: A comparison of our TLF architecture to the Swin a higher frame number. Additionally, more samples enable
Transformer [17], the Video Swin Transformer [18] and the the compensation of inappropriate samples with other sam-
TimeSformer [3] models. Our model achieves a superior ples.
top-1 accuracy to the other models with only a slight in-
Using dense sampling with the Swin-Base model leads
crease in parameters.
to a drastic drop in accuracy due to the lack of variance
of continuously sampled images. Increasing the number of
sampled clips to four during testing for a total number of
shows a significant advantage over the baseline and TimeS-
64 frames, the accuracy increases again but is still lower
former and even more over the Video Swin Transformer.
than using 8 frames with a sparse sampling strategy. In
This is achieved by using a model design which is targeted
contrast to these results, VideoSwin-Base does not perform
towards video-based fine-grained object recognition. The
well with sparse sampling showing a large drop compared
advantage is particularly remarkable considering the small
to the baseline model. Dense sampling with an increased
increase in computational complexity compared to the in-
number of clips closes the gap but the performance is still
crease in accuracy as shown in Figure 2. This also holds
worse than the baseline. VideoSwin-Base performs an early
true when considering the number of parameters which is
fusion which requires a high similarity of the images. This
shown in Figure 4. All FLOPs and number of parameters
is only sensible with a dense sampling strategy explaining
are reported for the case of 64 input samples and ignoring
the lower accuracy with sparse sampling. However, for fine-
augmentations for a fair comparison.
grained object recognition, a dense sampling strategy is not
appropriate due to a high variety of perspectives being the
4.4. Ablation studies most important factor for efficiently exploiting videos.
In this section, the impact of the improvements and de- TimeSformer shows a low accuracy with its default
sign choices is evaluated. dense sampling strategy. However, it can profit signifi-

105
Model Sampling Strategy Training Samples Testing Samples Top-1 Top-5
Swin-Base Sparse 8 8 73.2 92.1
Swin-Base Sparse 8 32 75.6 94.0
Swin-Base Sparse 16 32 76.1 93.9
Swin-Base Sparse 16 64 76.1 93.7
Swin-Base Dense 16×2×1 16×2×1 55.0 78.0
Swin-Base Dense 16×2×1 16×2×4 72.5 92.2
VideoSwin-Base Sparse 16 64 70.0 91.2
VideoSwin-Base Dense 16×2×1 16×2×1 46.5 71.6
VideoSwin-Base Dense 16×2×1 16×2×4 71.9 90.6
TimeSformer Sparse 16 32 76.7 94.0
TimeSformer Sparse 16 64 77.9 94.6
TimeSformer Dense 8×32×1 8×32×1 58.3 80.7
TimeSformer Dense 8×32×1 8×32×4 45.1 69.5
Swin-Base + TLF (Ours) Sparse 16 32 80.5 96.0
Swin-Base + TLF (Ours) Sparse 16 64 80.6 96.0

Table 2: Comparison of sample type and sample sizes during training and testing. Swin-Base with a simple average feature
fusion performs best with sparse sampling and a high number of samples during testing while the number of samples during
training has only a small impact. Dense sampling performs worse for Swin-Base while it is advantageous for VideoSwin-
Base which relies on a high similarity of images in a single clip due to its early-fusion approach. Our TLF performs best
since it combines the strategy of a sophisticated fusion and a late-fusion.

cantly from using a sparse sampling strategy and with using Positional encoding Top-1 Top-5
the sparse sampling, it is slightly superior compared to our
baseline. Yes 76.2 93.6
No 80.5 96.0
Since our TLF approach performs a late-fusion, we ap-
ply sparse sampling for it. Even with only 32 images during
testing, it outperforms all other evaluated models by a sig- Table 3: Evaluation of positional encoding. We applied a
nificant margin. Increasing the number of testing samples fixed sine-based positional encoding as it is common for
from 32 to 64 shows a slight increase in terms of accuracy. transformer architectures. However, a positional encoding
has shown a significant drop in accuracy for our use case.
Positional encoding. Transformer architectures usually ap-
ply a positional encoding to provide the information about
4.5. Effectiveness of video classification
the sequence of the inputs to the model. Since the origi-
nal proposal of the transformer architecture [30], it is the In Table 4, we compare the use of videos to the use of
default setting to use a positional encoding. Thus, we also single images for fine-grained classification of cars. For
evaluate the application of a positional encoding. For the the comparison, we use a Swin-Base model with an av-
experiment, we use the original fixed sine-based positional erage consensus fusion for the video classification and a
encoding as proposed by Vaswani et al. [30]. The results plain Swin-Base model for the image classification. In both
are shown in Table 3. The positional encoding is a signifi- cases, sparse sampling with 8 frames in training and 32
cant disadvantage for this task with the architecture without frames in testing is used. For the per-frame evaluation, each
a positional encoding achieving a higher accuracy. Thus, frame is evaluated individually and the average accuracy
we drop the positional encoding for our TLF approach. over all frames is calculated. Since the number of sampled
The reason for the negative impact of the positional en- frames per video is constant, the results are comparable to
coding is likely an overfitting of the network when it is able the per-video evaluation results. As can be seen, the use of
to identify the ordering of the frames and thus expects the per-video evaluation provides a drastic increase in accuracy.
same order during inference. Moreover, for fine-grained ob- This shows the advantage of video classification compared
ject recognition the ordering of the frames is rarely impor- to single image classification for fine-grained object recog-
tant since most frames are of different scenes anyway due nition and should motivate future research in this direction.
to cuts occurring in the videos. The expressiveness of these results might be limited due to

106
Evaluation Top-1 Top-5 4.7. Discussion of results
Per-frame 64.9 86.5 While our approach outperforms the competitive archi-
Per-video 73.9 92.6 tectures presented, the influential effect of how the videos
are sampled deserves to be emphasized again. Switching
Table 4: Comparison of a per-frame evaluation and a per- between dense and sparse sampling can significantly affect
video evaluation. In case of the per-frame evaluation, the the final accuracy of the model, with no sampling strat-
sampling is unchanged but no average is calculated over the egy being blatantly superior. As an example, the Video
images. Instead, each image is evaluated separately. For Swin Transformer thrives on dense sampling, while the
the per-video evaluation, a simple feature average fusion is Swin Transformer performs better when sparsely sampling
used. The comparison shows that using videos has a clear the video. This indicates that the model architecture might
advantage over using only frames since the combination of dictate the sampling strategy, with dense sampling being
multiple frames provides additional information useful for preferably used with early-fusion and sparse sampling with
classification. late-fusion approaches. Nonetheless, overall the results
show that a sparse sampling strategy with a late-fusion ap-
proach is superior. Furthermore, models tend to benefit
Architecture Top-1 Top-5 from higher sample counts, both during training and testing.
Swin Transformer Base [17] 76.9 91.1 A fair comparison therefore requires two models to receive
VideoSwinTransformer [18] 72.8 88.2 an identical amount of frames. However, while the frame
TimeSformer [3] 77.6 90.9 count might be identical, the frames themselves might not
TLF (Ours) 78.9 91.3 be. This can cause issues due to the high variance in frame
I3D* 40.7 quality, which we define as the volume of useful informa-
BN-Inception* 60.1 tion a frame provides. Additionally, frames might have a
TSN* 72.4 low information content regardless of the sampling mode.
RRA* 73.2 Exemplary causes of this in the YouTube-Cars dataset are
*
Original results by the authors of the YouTube- transitions within the video, scenes showing the car interior
Birds dataset [35]. Results are not directly com- or frames where the car is occluded by people or other cars.
parable since not all videos are available any- These issues are mitigated in some cases by sampling a suf-
more. ficient amount of frames, but due to the inherent random-
ness of the sampling and the varying information density of
Table 5: Results of different classification architectures on the video data, this can not be guaranteed.
the YouTube-Birds dataset [35]. Our transformer-based late
fusion mechanism outperforms the baseline by a significant 5. Conclusion
margin due to the more sophisticated fusion of frames.
We propose a sophisticated late-fusion approach for fine-
grained object recognition using video data. By adding self-
some frames showing irrelevant information and being use- attention through a transformer encoder, a simple average
less for classification. However, for practical applications consensus mechanism can be extended to achieve results
like surveillance, a single frame can also be suboptimal for superior to both a state-of-the-art video classification archi-
object recognition due to e.g. being blurred. In this case, tecture and the basic consensus mechanism with a larger
video classification can compensate single blurred images backbone network. The transformer encoder applied in the
by ignoring the frame and using information from other late stages of the network enables the fusion of semanti-
frames. cally high-level features and thus, better exploits the multi-
ple views offered by video data. Since the presented mecha-
4.6. Results on YouTube-Birds nism comes with low additional cost in terms of FLOPs and
parameters, it is more applicable in a real-time setting than
To show the general effectiveness of our approach for a larger model with comparable accuracy.
fine-grained video classification, we compare our TLF to Finally, we show that proper sampling is a central fac-
a strong baseline on YouTube-Birds [35]. The results are tor in classification accuracy, both in terms of sample count
shown in Table 5. We sample 64 frames in total per video and sample distribution across the input video. Making the
with the best sampling strategy chosen per model. Similar sampling strategy more intelligent and focused on yielding
to the results on YouTube-Cars, our model shows a signif- frames containing useful information instead of relying on
icant increase in terms of classification accuracy compared randomness is an important area of future work that could
to the baseline and state-of-the-art models. drastically improve results.

107
References [13] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
3d object representations for fine-grained categorization. In
[1] Yousef Alsahafi, Daniel Lemmond, Jonathan Ventura, and 4th International IEEE Workshop on 3D Representation and
Terrance Boult. Carvideos: A novel dataset for fine-grained Recognition (3dRR-13), 2013.
car classification in videos. In 16th International Conference [14] Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: To-
on Information Technology-New Generations (ITNG 2019), wards action recognition without representation bias. In Eu-
2019. ropean Conference on Computer Vision (ECCV), 2018.
[2] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen [15] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji.
Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vi- Bilinear cnn models for fine-grained visual recognition. In
sion transformer. In IEEE/CVF International Conference on IEEE International Conference on Computer Vision (ICCV),
Computer Vision (ICCV), 2021. 2015.
[3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is [16] Shenglan Liu, Xiang Liu, Gao Huang, Hong Qiao, Lianyu
space-time attention all you need for video understand- Hu, Dong Jiang, Aibin Zhang, Yang Liu, and Ge Guo. Fsd-
ing? In 38th International Conference on Machine Learning, 10: A fine-grained classification dataset for figure skating.
2021. Neurocomputing, 413:360–367, 2020.
[4] Joao Carreira and Andrew Zisserman. Quo vadis, action [17] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
recognition? a new model and the kinetics dataset. In Zhang, Stephen Lin, and Baining Guo. Swin transformer:
IEEE Conference on Computer Vision and Pattern Recog- Hierarchical vision transformer using shifted windows. In
nition (CVPR), 2017. IEEE/CVF International Conference on Computer Vision
[5] Decord Contributors. Decord. https://github.com/ (ICCV), 2021.
dmlc/decord, 2022. [18] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang,
[6] MMAction2 Contributors. Openmmlab’s next generation Stephen Lin, and Han Hu. Video swin transformer. In
video understanding toolbox and benchmark. https:// IEEE/CVF Conference on Computer Vision and Pattern
github.com/open-mmlab/mmaction2, 2020. Recognition (CVPR), 2022.
[7] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, [19] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradi-
Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Da- ent descent with warm restarts. In International Conference
vide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, on Learning Representations (ICLR), 2017.
and Michael Wray. Scaling egocentric vision: The epic- [20] Ilya Loshchilov and Frank Hutter. Decoupled weight de-
kitchens dataset. In European Conference on Computer Vi- cay regularization. In International Conference on Learning
sion (ECCV), 2018. Representations (ICLR), 2019.
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [21] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Assel-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, mann. Video transformer network. In IEEE/CVF Interna-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- tional Conference on Computer Vision (ICCV) Workshops,
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is 2021.
worth 16x16 words: Transformers for image recognition at [22] Yuxin Peng, Xiangteng He, and Junjie Zhao. Object-part
scale. In International Conference on Learning Representa- attention model for fine-grained image classification. IEEE
tions, 2021. Transactions on Image Processing, 27(3):1487–1500, 2017.
[9] Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, and [23] AJ Piergiovanni and Michael S. Ryoo. Fine-grained activity
Dahua Lin. Omni-sourced webly-supervised learning for recognition in baseball videos. In IEEE Conference on Com-
video recognition. In European Conference on Computer puter Vision and Pattern Recognition (CVPR) Workshops,
Vision (ECCV), 2020. 2018.
[10] ZongYuan Ge, Chris McCool, Conrad Sanderson, Peng [24] Tomoaki Saito, Asako Kanezaki, and Tatsuya Harada.
Wang, Lingqiao Liu, Ian Reid, and Peter Corke. Exploit- Ibc127: Video dataset for fine-grained bird classification.
ing temporal information for dcnn-based fine-grained object In 2016 IEEE International Conference on Multimedia and
classification. In 2016 International Conference on Digital Expo (ICME), 2016.
Image Computing: Techniques and Applications (DICTA), [25] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A
2016. hierarchical video dataset for fine-grained action understand-
[11] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, ing. In IEEE Conference on Computer Vision and Pattern
and Juan Carlos Niebles. Activitynet: A large-scale video Recognition (CVPR), 2020.
benchmark for human activity understanding. 2015 IEEE [26] Karen Simonyan and Andrew Zisserman. Two-stream con-
Conference on Computer Vision and Pattern Recognition volutional networks for action recognition in videos. In Ad-
(CVPR), 2015. vances in Neural Information Processing Systems, 2014.
[12] Kline Innovation. Audi r8 gt, valvetronic race exhaust [27] K. Soomro, A. Zamir, and M. Shah. Ucf101: A dataset of
by kline innovation. https://www.youtube.com/ 101 human actions classes from videos in the wild. ArXiv,
watch?v=VDDQhOaj1ss, 2016. Accessed: 2022-11- abs/1212.0402, 2012.
17. Licence: CC BY 3.0 https://creativecommons. [28] Faezeh Tafazzoli, Hichem Frigui, and Keishin Nishiyama.
org/licenses/by/3.0/. A large and diverse dataset for improved vehicle make and

108
model recognition. In IEEE Conference on Computer Vision
and Pattern Recognition (2017) Workshops, 2017.
[29] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
and Manohar Paluri. Learning spatiotemporal features with
3d convolutional networks. In IEEE International Confer-
ence on Computer Vision (ICCV), 2015.
[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, 2017.
[31] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
networks: Towards good practices for deep action recogni-
tion. In European Conference on Computer Vision (ECCV),
2016.
[32] Peter Welinder, Steve Branson, Takeshi Mita, Catherine
Wah, Florian Schroff, Serge Belongie, and Pietro Perona.
Caltech-ucsd birds 200. California Institute of Technology.
CNS-TR-2010-001, 2010.
[33] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang.
A large-scale car dataset for fine-grained categorization and
verification. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2015.
[34] Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao
Lin, and Qi Tian. Picking deep filter responses for fine-
grained image recognition. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016.
[35] Chen Zhu, Xiao Tan, Feng Zhou, Xiao Liu, Kaiyu Yue, Errui
Ding, and Yi Ma. Fine-Grained Video Categorization with
Redundancy Reduction Attention. In European Conference
on Computer Vision (ECCV), 2018.

109

You might also like