Koch A Transformer-Based Late-Fusion Mechanism For Fine-Grained Object Recognition in Videos WACVW 2023 Paper
Koch A Transformer-Based Late-Fusion Mechanism For Fine-Grained Object Recognition in Videos WACVW 2023 Paper
Koch A Transformer-Based Late-Fusion Mechanism For Fine-Grained Object Recognition in Videos WACVW 2023 Paper
Recognition in Videos
Abstract
100
7 / ) 2 X U V fusion mechanism that efficiently aggregates the fea-
tures of multiple input images of a video to enhance
101
Segment 1
Multi-Head Attention
Segment 2
Classification
Segment 3
Figure 3: Schematic illustration of our Transformer-based late-fusion approach for fine-grained video classification.
Temporal Transformer. The recent trend of using self- while only few datasets [1, 10, 24, 35] exist for our applica-
attention for image recognition has also been picked tion of fine-grained object recogntion in videos limiting the
up for video classification. These architectures use the reserch progress.
Transformer mechanism to process sequences of images.
ViViT [2] extends the ViT [8] architecture to enable the pro- Fine-grained image classification. While fine-grained
cessing of image sequences. The authors propose multiple classification on single images can technically be real-
model variants including early-fusion and late-fusion meth- ized using conventional image classification methods, spe-
ods. Neimark et al. [21] also propose a late-fusion Trans- cialized architectures have emerged to improve their re-
former architecture. However, early-fusion approaches sults [15, 22, 34]. Initial models focused on the explicit
have prevailed due to being advantageous for action recog- identification of parts to distinguish the various classes, but
nition as shown by Arnab et al. [2]. While these late-fusion with the advent of these deep learning approaches, identify-
architectures are the most similar ones compared to our ing relevant features became both implicit and learnable.
Transformer-based late-fusion mechanism, we show that
fine-grained video classification has drastically different re- Fine-grained video classification. In the context of fine-
quirements leading to different design decisions. Video grained video classification, the research is significantly
Swin Transformer [18] continues the trend of early-fusion more narrow. Alsahafi et al. [1] use object detection to lo-
architectures with the extension of the shifted windows calize vehicles in videos and extract the relevant parts of
mechanism of Swin Transformer [17] to the temporal di- the images. Afterwards, an imagewise CNN and a simple
mension. The shifted windows reduce the computational fusion mechanism are used for classification. Redundancy
complexity while ensuring inter-token information shar- Reduction Attention [35] uses spatial and temporal atten-
ing. Bertasius et al. [3] propose TimeSformer as another tion to suppress redundant information in a video in an iter-
Transformer-based video classification architecture that di- ative manner.
vides the spatial and temporal attention to reduce complex-
ity and increase accuracy while using an early-fusion ap- 3. Method
proach due to being tailored towards action recogintion.
Our method is based on three parts which are illus-
2.2. Fine-grained classification trated in Figure 3. First, we extract frames from the in-
put video by a sparse sampling strategy to cover the full
This section is divided into fine-grained classification range of the video. Afterwards we extract features of the
based on images and videos. For the first, a large set of frames with a modern Swin Transformer [17] backbone. As
datasets [13, 28, 33] has been published motivating a vari- the last step, we apply a sophisticated Transformer-based
ety of algorithmic approaches. In contrast, the field of fine- late-fusion mechanism to derive a fine-grained classification
grained video classification is rather small. Most of the fine- score for each input video. Since fusion mechanisms imple-
grained video classification datasets and research works are mented in the backbone might fail to find meaningful fea-
about fine-grained action recognition [7, 14, 16, 23, 25] ture relationships early on, we postpone this operation to the
102
classification head. Late-fusion approaches like this usu- plemented by multiplying Q and K and scaling the result
ally rely on simple consensus mechanisms like averaging in relation to their dimensionanilty, followed by a softmax.
feature vectors, followed by a fully-connected, final classi- Scaling is required to prevent the dot-product from growing
fication layer. Temporal Segment Networks use this tech- too large in magnitude.
nique with great success [31], but have limited applicability The resulting matrix scores the importance of each input
in fine-grained classification due to their lack of attention element with a value between 0 and 1. This attention ma-
across frame boundaries. Our approach prepends a Trans- trix is then applied to V via matrix multiplication. Overall,
former encoder [30] to the consensus mechanism, which attention can be expressed concisely via Equation 1.
applies self-attention across all frames simultaneously, em-
phasizing important features. This self-attention provides \label {eq:attention} \attention (Q, K, V) = \softmax (\frac {QK^T}{\sqrt {d_k}})V (1)
an additional pathway to improving model accuracy in fine-
grained classification, as correctly distinguishing closely- Once the multi-head attention is defined, building a full
related classes might depend on very few features. Transformer encoder only requires two additional compo-
nents. First, a residual connection has to be inserted, fol-
Sampling. In the first step of our approach, we apply a
lowed by an addition and normalization step that combines
sparse sampling strategy that splits the video into a specific
the residual data with the results. Second, a feed-forward
number of segments and selects a random frame from each
network is appended to reduce the output dimension to the
segment. The set of frames is called N. Each resulting frame
desired shape.
is augmented as described in Section 4. While the number
of images is multiplied during inference due to augmenta- Transformer-based late-fusion. The feature vectors E are
tion, all preprocessed images are handled as a single set of all passed through the Transformer encoder simultaneously.
input images S. For our architecture, we use 8 heads within the multi-head
Sparse sampling is rarely used for video classification attention model and 1024 for both the input features and the
due to dense sampling being advantageous for action recog- dimension of the feed-forward network. Hence, the result-
nition which requires the extraction of short-term context. ing feature vectors still have the shape 1024x1x1. These
While Wang et al. [31] propose the sparse sampling strat- feature vectors are averaged to provide a single 1024x1x1
egy for action recognition by sampling the RGB frames feature vector to the classification stage.
sparsely across the video, they still employ some form of Compared to a simple average fusion without the Trans-
dense sampling by using stacks short-term optical flows former encoder, our Transformer-based late-fusion mecha-
in a two-stream architecture. However, fine-grained object nism enables a more sophisticated aggregation. It can rep-
recognition is profiting from using videos in a different way resent interdependencies between the views of images that
and thus, the widely applied dense sampling is inferior to a can not be represented by a simple linear aggregation of the
sparse sampling strategy as we show in our experiments. features.
Feature extraction. Each input image is fed separately into Classification. Once the consensus has been applied, the
the Swin Transformer [17] backbone. The result is a fea- resulting feature vector is fed into a dropout and a final
ture vector with the shape of 1024x7x7 for each augmented fully-connected layer to classify the video. Afterwards, a
image. Afterwards, each feature map is reduced to a fea- softmax is applied to normalize the output scores.
ture vector of shape 1024x1x1 via adaptive average pooling.
The total set of feature vectors of all augmentations from all 4. Experiments
frames is called E.
In the following paragraphs, the implementation, results
Self-attention. The self-attention mechanism central to our and evaluation of our experiments will be discussed.
late-fusion mechanism is part of the Transformer architec-
4.1. Settings
ture [30]. Specifically, self-attention is achieved via the
multi-head attention block. Usually, this block receives dif- Optimization. The AdamW [20] optimizer with a Cosine
ferent query Q, key K and value V matrices. In the case of Annealing [19] policy was consistently used during train-
self-attention, all these matrices are identical, meaning they ing. The learning rate was also kept consistent at a base
will equally be set to the input features. value of 10−4 when combined with a batch size of 8. Due
Each head learns its own set of linear transformations to VRAM limitations, this batch size had to be halved to 4
that are applied to the input data, followed by a scaled dot- in some experiments. In these instances, the learning rate
product attention mechanism. All head outputs are even- was also halved accordingly to 5 · 10−5 . All experiments
tually concatenated and linearly transformed to the desired use a weight decay of 10−2 . These values are based on the
output dimension. The scaled dot-product attention is im- defaults used in mmaction2 [6].
103
Video decoding. The Decord [5] video loader used by 4.2. Dataset
mmaction2 provides two different modes of operation: effi-
The YouTube-Cars dataset [35] is used to evaluate our
cient and accurate. Choosing the efficient mode reduces the
architecture, since it provides video data with fine-grained
time it takes to extract random frame samples, as Decord
labels. Additionally, the YouTube-Birds dataset provided
then utilizes a fast, inexact random seek algorithm that only
by the same authors is used for additional validation of the
returns Intra-Frames (or I-Frames). The drawback is the
model’s efficacy. Experiments are done on the YouTube-
possibility of receiving the same frame twice when two
Cars dataset if not mentioned otherwise. YouTube-Cars
samples are sufficiently close to each other. We chose to
provides video data for 196 classes and YouTube-Birds
employ the efficient mode only during training, if samples
for 200 classes, with the class selection being identical to
are drawn in a sparse fashion. For dense sampling and
Stanford Cars [13] and CUB-200-2011 [32], respectively.
during testing, accurate sampling was used. This yielded
The full car dataset contains 10,238 videos for training and
a compromise of lowered training times allowing possibly
4,855 videos for testing purposes while the bird dataset pro-
duplicated frames and preventing frame duplications during
vides 12,666 training and 5,684 testing videos.
evaluations.
As YouTube is an inherently unreliable data source,
video availability is never guaranteed. Thus, some of the
Sampling. Processing all images in a video is inappropriate videos of the datasets were not available anymore when we
for classification due to limited compute resources and re- feteched the dataset. Hence, any comparison of our work
dundancy of consecutive frames. Thus, a sampling strategy to the results in the original paper is limited in its validity
has to be applied to select a number of images that can be re- and should be considered tentative. However, both datasets
alistically handled and that are most appropriate for an accu- were kept consistent during our experiments. While some
rate classification. In video classification, sparse and dense videos were not available anymore, most of the data could
sampling are the common sampling strategies. In sparse still be fetched and no class had to be removed due to a lack
sampling, the video is divided into k parts and from each of footage.
part, a single image is selected randomly. In dense sam- 4.3. Comparison with state-of-the-art
pling, the video is also divided in k parts. However, from
each part, a contiguous sequence of images with length l To prove the effectiveness of TLF, we compare our ap-
and stride s is chosen randomly. For experiments of dense proach against Swin Transformer [17] with a simple feature
sampling, we report the parameters as l × s × k. average consensus as a strong baseline model and the state-
of-the-art video classification models Video Swin Trans-
former [18] and TimeSformer [3]. While we also include
Augmentations. During training, random horizontal flip is published results on the YouTube-Cars dataset [35], the re-
used per video and each sampled input image is augmented sults are not directly comparable due to some videos not
with a random crop. Afterwards, the frame is resized to being available anymore as described in Section 4.2. We
224x224 pixels. Each random crop has a random position used the best of all evaluated sampling strategies and num-
in the frame, with all possible positions being equally prob- ber of samples for each model. The impact of the sam-
able. The dimension is calculated based on a given aspect pling strategy is described in Section 4.4. The results of the
ratio and the total area covered, both of which are chosen comparison with the state-of-the-art are shown in Table 1
randomly within a given interval. In our case, the aspect and indicate an advantage of the baseline Swin Transformer
ratio is chosen randomly in the interval [ 34 , 43 ] and the area model with a feature averaging fusion over the Video Swin
in [0.08, 1], with the area being interpreted as a percentage Transformer which is a state-of-the-art video classification
of the total frame size. For testing, each frame is cropped model. The Video Swin Transformer only performs well
5 times: once in each corner and once in the center of the for tasks requiring an analysis of short sequences like action
frame. This time, all crops have a static width and height of recognition but not for tasks covering long range correspon-
224 pixels. Additionally, each augmentation is duplicated dences in videos. This highlights the different requirements
and flipped along the vertical axis, yielding 10 augmenta- of action recognition as the prime task for video classifica-
tions in total. Once an input frame has been augmented and tion research and fine-grained object recognition which has
resized to the input dimension of 224x224 pixels, the results not received the same attention in research yet. The TimeS-
are passed into the backbone network. former model can outperform our baseline slightly since it
can make appropriate use of sparse sampling. However, due
For Video Swin Transformer [18] and TimeSformer [3], to the Transformer-based fusion being applied in an early-
we use a three crop strategy as intended by their authors. fusion manner, the architecture still cannot make use of the
However, preliminary experiments have shown that the dif- Transformer to ist full advantage. In comparison, our sim-
ferences between three crop and ten crop are negligible. ple TLF mechanism combined with a Swin-Base backbone
104
Architecture Top-1 Top-5 #parameters (106 ) FLOPs (109 )
Swin Transformer Base [17] 76.1 93.7 86.9 969.0
Swin Transformer Large [17] 76.9 94.5 195.3 2178.5
VideoSwinTransformer Base [18] 71.9 90.6 87.8 485.1
TimeSformer [3] 77.9 94.6 121.5 807.2
Transformer-based Late-Fusion (Ours) 80.6 96.0 93.3 969.1
Inflated 3D Convolutional Neural Network (I3D)* 40.9
Batch-Normalized-Inception (BN-Inception)* 62.0
Temporal Segment Network (TSN)* 74.3
Redundancy Reduced Attention (RRA)* 77.6
*
Original results by the authors of the YouTube-Cars dataset [35]. Results are not directly comparable since
not all videos are available anymore.
Table 1: Results of different classification architectures on the YouTube-Cars dataset [35]. The state-of-the-art video classifi-
cation Video Swin Transformer performs poorly for fine-grained vehicle recognition due to the early-fusion design optimized
for action recognition. A modern single image backbone with a simple late-fusion average consensus achieves better results.
In comparison, our Transformer-based late-fusion mechanism shows a significantly higher accuracy.
105
Model Sampling Strategy Training Samples Testing Samples Top-1 Top-5
Swin-Base Sparse 8 8 73.2 92.1
Swin-Base Sparse 8 32 75.6 94.0
Swin-Base Sparse 16 32 76.1 93.9
Swin-Base Sparse 16 64 76.1 93.7
Swin-Base Dense 16×2×1 16×2×1 55.0 78.0
Swin-Base Dense 16×2×1 16×2×4 72.5 92.2
VideoSwin-Base Sparse 16 64 70.0 91.2
VideoSwin-Base Dense 16×2×1 16×2×1 46.5 71.6
VideoSwin-Base Dense 16×2×1 16×2×4 71.9 90.6
TimeSformer Sparse 16 32 76.7 94.0
TimeSformer Sparse 16 64 77.9 94.6
TimeSformer Dense 8×32×1 8×32×1 58.3 80.7
TimeSformer Dense 8×32×1 8×32×4 45.1 69.5
Swin-Base + TLF (Ours) Sparse 16 32 80.5 96.0
Swin-Base + TLF (Ours) Sparse 16 64 80.6 96.0
Table 2: Comparison of sample type and sample sizes during training and testing. Swin-Base with a simple average feature
fusion performs best with sparse sampling and a high number of samples during testing while the number of samples during
training has only a small impact. Dense sampling performs worse for Swin-Base while it is advantageous for VideoSwin-
Base which relies on a high similarity of images in a single clip due to its early-fusion approach. Our TLF performs best
since it combines the strategy of a sophisticated fusion and a late-fusion.
cantly from using a sparse sampling strategy and with using Positional encoding Top-1 Top-5
the sparse sampling, it is slightly superior compared to our
baseline. Yes 76.2 93.6
No 80.5 96.0
Since our TLF approach performs a late-fusion, we ap-
ply sparse sampling for it. Even with only 32 images during
testing, it outperforms all other evaluated models by a sig- Table 3: Evaluation of positional encoding. We applied a
nificant margin. Increasing the number of testing samples fixed sine-based positional encoding as it is common for
from 32 to 64 shows a slight increase in terms of accuracy. transformer architectures. However, a positional encoding
has shown a significant drop in accuracy for our use case.
Positional encoding. Transformer architectures usually ap-
ply a positional encoding to provide the information about
4.5. Effectiveness of video classification
the sequence of the inputs to the model. Since the origi-
nal proposal of the transformer architecture [30], it is the In Table 4, we compare the use of videos to the use of
default setting to use a positional encoding. Thus, we also single images for fine-grained classification of cars. For
evaluate the application of a positional encoding. For the the comparison, we use a Swin-Base model with an av-
experiment, we use the original fixed sine-based positional erage consensus fusion for the video classification and a
encoding as proposed by Vaswani et al. [30]. The results plain Swin-Base model for the image classification. In both
are shown in Table 3. The positional encoding is a signifi- cases, sparse sampling with 8 frames in training and 32
cant disadvantage for this task with the architecture without frames in testing is used. For the per-frame evaluation, each
a positional encoding achieving a higher accuracy. Thus, frame is evaluated individually and the average accuracy
we drop the positional encoding for our TLF approach. over all frames is calculated. Since the number of sampled
The reason for the negative impact of the positional en- frames per video is constant, the results are comparable to
coding is likely an overfitting of the network when it is able the per-video evaluation results. As can be seen, the use of
to identify the ordering of the frames and thus expects the per-video evaluation provides a drastic increase in accuracy.
same order during inference. Moreover, for fine-grained ob- This shows the advantage of video classification compared
ject recognition the ordering of the frames is rarely impor- to single image classification for fine-grained object recog-
tant since most frames are of different scenes anyway due nition and should motivate future research in this direction.
to cuts occurring in the videos. The expressiveness of these results might be limited due to
106
Evaluation Top-1 Top-5 4.7. Discussion of results
Per-frame 64.9 86.5 While our approach outperforms the competitive archi-
Per-video 73.9 92.6 tectures presented, the influential effect of how the videos
are sampled deserves to be emphasized again. Switching
Table 4: Comparison of a per-frame evaluation and a per- between dense and sparse sampling can significantly affect
video evaluation. In case of the per-frame evaluation, the the final accuracy of the model, with no sampling strat-
sampling is unchanged but no average is calculated over the egy being blatantly superior. As an example, the Video
images. Instead, each image is evaluated separately. For Swin Transformer thrives on dense sampling, while the
the per-video evaluation, a simple feature average fusion is Swin Transformer performs better when sparsely sampling
used. The comparison shows that using videos has a clear the video. This indicates that the model architecture might
advantage over using only frames since the combination of dictate the sampling strategy, with dense sampling being
multiple frames provides additional information useful for preferably used with early-fusion and sparse sampling with
classification. late-fusion approaches. Nonetheless, overall the results
show that a sparse sampling strategy with a late-fusion ap-
proach is superior. Furthermore, models tend to benefit
Architecture Top-1 Top-5 from higher sample counts, both during training and testing.
Swin Transformer Base [17] 76.9 91.1 A fair comparison therefore requires two models to receive
VideoSwinTransformer [18] 72.8 88.2 an identical amount of frames. However, while the frame
TimeSformer [3] 77.6 90.9 count might be identical, the frames themselves might not
TLF (Ours) 78.9 91.3 be. This can cause issues due to the high variance in frame
I3D* 40.7 quality, which we define as the volume of useful informa-
BN-Inception* 60.1 tion a frame provides. Additionally, frames might have a
TSN* 72.4 low information content regardless of the sampling mode.
RRA* 73.2 Exemplary causes of this in the YouTube-Cars dataset are
*
Original results by the authors of the YouTube- transitions within the video, scenes showing the car interior
Birds dataset [35]. Results are not directly com- or frames where the car is occluded by people or other cars.
parable since not all videos are available any- These issues are mitigated in some cases by sampling a suf-
more. ficient amount of frames, but due to the inherent random-
ness of the sampling and the varying information density of
Table 5: Results of different classification architectures on the video data, this can not be guaranteed.
the YouTube-Birds dataset [35]. Our transformer-based late
fusion mechanism outperforms the baseline by a significant 5. Conclusion
margin due to the more sophisticated fusion of frames.
We propose a sophisticated late-fusion approach for fine-
grained object recognition using video data. By adding self-
some frames showing irrelevant information and being use- attention through a transformer encoder, a simple average
less for classification. However, for practical applications consensus mechanism can be extended to achieve results
like surveillance, a single frame can also be suboptimal for superior to both a state-of-the-art video classification archi-
object recognition due to e.g. being blurred. In this case, tecture and the basic consensus mechanism with a larger
video classification can compensate single blurred images backbone network. The transformer encoder applied in the
by ignoring the frame and using information from other late stages of the network enables the fusion of semanti-
frames. cally high-level features and thus, better exploits the multi-
ple views offered by video data. Since the presented mecha-
4.6. Results on YouTube-Birds nism comes with low additional cost in terms of FLOPs and
parameters, it is more applicable in a real-time setting than
To show the general effectiveness of our approach for a larger model with comparable accuracy.
fine-grained video classification, we compare our TLF to Finally, we show that proper sampling is a central fac-
a strong baseline on YouTube-Birds [35]. The results are tor in classification accuracy, both in terms of sample count
shown in Table 5. We sample 64 frames in total per video and sample distribution across the input video. Making the
with the best sampling strategy chosen per model. Similar sampling strategy more intelligent and focused on yielding
to the results on YouTube-Cars, our model shows a signif- frames containing useful information instead of relying on
icant increase in terms of classification accuracy compared randomness is an important area of future work that could
to the baseline and state-of-the-art models. drastically improve results.
107
References [13] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
3d object representations for fine-grained categorization. In
[1] Yousef Alsahafi, Daniel Lemmond, Jonathan Ventura, and 4th International IEEE Workshop on 3D Representation and
Terrance Boult. Carvideos: A novel dataset for fine-grained Recognition (3dRR-13), 2013.
car classification in videos. In 16th International Conference [14] Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: To-
on Information Technology-New Generations (ITNG 2019), wards action recognition without representation bias. In Eu-
2019. ropean Conference on Computer Vision (ECCV), 2018.
[2] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen [15] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji.
Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vi- Bilinear cnn models for fine-grained visual recognition. In
sion transformer. In IEEE/CVF International Conference on IEEE International Conference on Computer Vision (ICCV),
Computer Vision (ICCV), 2021. 2015.
[3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is [16] Shenglan Liu, Xiang Liu, Gao Huang, Hong Qiao, Lianyu
space-time attention all you need for video understand- Hu, Dong Jiang, Aibin Zhang, Yang Liu, and Ge Guo. Fsd-
ing? In 38th International Conference on Machine Learning, 10: A fine-grained classification dataset for figure skating.
2021. Neurocomputing, 413:360–367, 2020.
[4] Joao Carreira and Andrew Zisserman. Quo vadis, action [17] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
recognition? a new model and the kinetics dataset. In Zhang, Stephen Lin, and Baining Guo. Swin transformer:
IEEE Conference on Computer Vision and Pattern Recog- Hierarchical vision transformer using shifted windows. In
nition (CVPR), 2017. IEEE/CVF International Conference on Computer Vision
[5] Decord Contributors. Decord. https://github.com/ (ICCV), 2021.
dmlc/decord, 2022. [18] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang,
[6] MMAction2 Contributors. Openmmlab’s next generation Stephen Lin, and Han Hu. Video swin transformer. In
video understanding toolbox and benchmark. https:// IEEE/CVF Conference on Computer Vision and Pattern
github.com/open-mmlab/mmaction2, 2020. Recognition (CVPR), 2022.
[7] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, [19] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradi-
Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Da- ent descent with warm restarts. In International Conference
vide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, on Learning Representations (ICLR), 2017.
and Michael Wray. Scaling egocentric vision: The epic- [20] Ilya Loshchilov and Frank Hutter. Decoupled weight de-
kitchens dataset. In European Conference on Computer Vi- cay regularization. In International Conference on Learning
sion (ECCV), 2018. Representations (ICLR), 2019.
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [21] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Assel-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, mann. Video transformer network. In IEEE/CVF Interna-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- tional Conference on Computer Vision (ICCV) Workshops,
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is 2021.
worth 16x16 words: Transformers for image recognition at [22] Yuxin Peng, Xiangteng He, and Junjie Zhao. Object-part
scale. In International Conference on Learning Representa- attention model for fine-grained image classification. IEEE
tions, 2021. Transactions on Image Processing, 27(3):1487–1500, 2017.
[9] Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, and [23] AJ Piergiovanni and Michael S. Ryoo. Fine-grained activity
Dahua Lin. Omni-sourced webly-supervised learning for recognition in baseball videos. In IEEE Conference on Com-
video recognition. In European Conference on Computer puter Vision and Pattern Recognition (CVPR) Workshops,
Vision (ECCV), 2020. 2018.
[10] ZongYuan Ge, Chris McCool, Conrad Sanderson, Peng [24] Tomoaki Saito, Asako Kanezaki, and Tatsuya Harada.
Wang, Lingqiao Liu, Ian Reid, and Peter Corke. Exploit- Ibc127: Video dataset for fine-grained bird classification.
ing temporal information for dcnn-based fine-grained object In 2016 IEEE International Conference on Multimedia and
classification. In 2016 International Conference on Digital Expo (ICME), 2016.
Image Computing: Techniques and Applications (DICTA), [25] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A
2016. hierarchical video dataset for fine-grained action understand-
[11] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, ing. In IEEE Conference on Computer Vision and Pattern
and Juan Carlos Niebles. Activitynet: A large-scale video Recognition (CVPR), 2020.
benchmark for human activity understanding. 2015 IEEE [26] Karen Simonyan and Andrew Zisserman. Two-stream con-
Conference on Computer Vision and Pattern Recognition volutional networks for action recognition in videos. In Ad-
(CVPR), 2015. vances in Neural Information Processing Systems, 2014.
[12] Kline Innovation. Audi r8 gt, valvetronic race exhaust [27] K. Soomro, A. Zamir, and M. Shah. Ucf101: A dataset of
by kline innovation. https://www.youtube.com/ 101 human actions classes from videos in the wild. ArXiv,
watch?v=VDDQhOaj1ss, 2016. Accessed: 2022-11- abs/1212.0402, 2012.
17. Licence: CC BY 3.0 https://creativecommons. [28] Faezeh Tafazzoli, Hichem Frigui, and Keishin Nishiyama.
org/licenses/by/3.0/. A large and diverse dataset for improved vehicle make and
108
model recognition. In IEEE Conference on Computer Vision
and Pattern Recognition (2017) Workshops, 2017.
[29] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
and Manohar Paluri. Learning spatiotemporal features with
3d convolutional networks. In IEEE International Confer-
ence on Computer Vision (ICCV), 2015.
[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in Neural
Information Processing Systems, 2017.
[31] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
networks: Towards good practices for deep action recogni-
tion. In European Conference on Computer Vision (ECCV),
2016.
[32] Peter Welinder, Steve Branson, Takeshi Mita, Catherine
Wah, Florian Schroff, Serge Belongie, and Pietro Perona.
Caltech-ucsd birds 200. California Institute of Technology.
CNS-TR-2010-001, 2010.
[33] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang.
A large-scale car dataset for fine-grained categorization and
verification. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2015.
[34] Xiaopeng Zhang, Hongkai Xiong, Wengang Zhou, Weiyao
Lin, and Qi Tian. Picking deep filter responses for fine-
grained image recognition. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016.
[35] Chen Zhu, Xiao Tan, Feng Zhou, Xiao Liu, Kaiyu Yue, Errui
Ding, and Yi Ma. Fine-Grained Video Categorization with
Redundancy Reduction Attention. In European Conference
on Computer Vision (ECCV), 2018.
109