research-article

Viewing from Frequency Domain: A DCT-based Information Enhancement Network for Video Person Re-Identification

Authors:

Xinbo GaoAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 227 - 235

https://doi.org/10.1145/3474085.3475566

Published: 17 October 2021 Publication History

Abstract

Video-based person re-identification (Re-ID) aims to match the target pedestrians under non-overlapping camera system by video tracklets. The key issue of video Re-ID focuses on exploring effective spatio-temporal features. Generally, the spatio-temporal information of a video sequence can be divided into two aspects: the discriminative information in each frame and the shared information over the whole sequence. To make full use of the rich information in video sequences, this paper proposes a Discrete Cosine Transform based Information Enhancement Network (DCT-IEN) to achieve more comprehensive spatio-temporal representation from frequency domain. Inspired by the principle that average pooling is one of the special frequency components in DCT (the lowest frequency component), DCT-IEN first adopts discrete cosine transform to convert the extracted feature maps into frequency domain, thereby retaining more information that embedded in different frequency components. With the help of DCT frequency spectrum, two branches are adopted to learn the final video representation: Frequency Selection Module (FSM) and Lowest Frequency Enhancement Module (LFEM). FSM explores the most discriminative features in each frame by aggregating different frequency components with attention mechanism. LFEM enhances the shared feature over the whole video sequence by frame feature regularization. By fusing these two kinds of features together, DCT-IEN finally achieves comprehensive video representation. We conduct extensive experiments on two widely used datasets. The experimental results verify our idea and demonstrate the effectiveness of DCT-IEN for video-based Re-ID.

References

[1]

Guangyi Chen, Yongming Rao, Jiwen Lu, and Jie Zhou. 2020. Temporal Coherence or Temporal Motion: Which Is More Critical for Video-Based Person Re-identification?. In European Conference on Computer Vision. 660--676.

Digital Library

[2]

Wenlin Chen, James Wilson, Stephen Tyree, Kilian Q Weinberger, and Yixin Chen. 2016. Compressing convolutional neural networks in the frequency domain. In ACM International Conference on Knowledge Discovery and Data Mining. 1475--1484.

Digital Library

[3]

Dahjung Chung, Khalid Tahboub, and Edward J Delp. 2017. A two stream siamese convolutional neural network for person re-identification. In IEEE International Conference on Computer Vision. 1983--1991.

[4]

Max Ehrlich and Larry S Davis. [n.d.]. Deep residual learning in the jpeg transform domain. In IEEE International Conference on Computer Vision. 3484--3493.

[5]

Yang Fu, Xiaoyang Wang, Yunchao Wei, and Thomas Huang. 2019. Sta: Spatial-temporal attention for large-scale video-based person re-identification. In AAAI Conference on Artificial Intelligence. 8287--8294.

[6]

Jiyang Gao and Ram Nevatia. 2018. Revisiting temporal modeling for video-based person reid. arXiv preprint arXiv:1805.02104 (2018).

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[8]

Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).

[9]

Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2020. Temporal complementary learning for video person re-identification. In European Conference on Computer Vision. 388--405.

[10]

Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen. 2019. Vrstc: Occlusion-free video person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 7183--7192.

[11]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7132--7141.

[12]

Jianing Li, Jingdong Wang, Qi Tian, Wen Gao, and Shiliang Zhang. 2019 a. Global-local temporal representations for video person re-identification. In IEEE International Conference on Computer Vision. 3958--3967.

[13]

Jianing Li, Shiliang Zhang, and Tiejun Huang. 2019 b. Multi-scale 3d convolution network for video based person re-identification. In AAAI Conference on Artificial Intelligence. 8618--8625.

[14]

Jianing Li, Shiliang Zhang, and Tiejun Huang. 2020. Multi-Scale Temporal Cues Learning for Video Person Re-Identification. IEEE Transactions on Image Processing, Vol. 29 (2020), 4461--4473.

[15]

Shuang Li, Slawomir Bak, Peter Carr, and Xiaogang Wang. 2018. Diversity regularized spatiotemporal attention for video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 369--378.

[16]

Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In IEEE Conference on Computer Vision and Pattern Recognition. 2197--2206.

[17]

Chih-Ting Liu, Chih-Wei Wu, Yu-Chiang Frank Wang, and Shao-Yi Chien. 2019 a. Spatially and temporally efficient non-local attention network for video-based person re-identification. arXiv preprint arXiv:1908.01683 (2019).

[18]

Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, and Jiashi Feng. 2017. Video-based person re-identification with accumulative motion context. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 10 (2017), 2788--2802.

Digital Library

[19]

Jiawei Liu, Zheng-Jun Zha, Xuejin Chen, Zilei Wang, and Yongdong Zhang. 2019 c. Dense 3D-convolutional neural network for person re-identification in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 1s (2019), 1--19.

Digital Library

[20]

Yiheng Liu, Zhenxun Yuan, Wengang Zhou, and Houqiang Li. 2019 b. Spatial and temporal mutual promotion for video-based person re-identification. In AAAI Conference on Artificial Intelligence. 8786--8793.

[21]

Niall McLaughlin, Jesus Martinez Del Rincon, and Paul Miller. 2016. Recurrent convolutional network for video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 1325--1334.

[22]

Jingke Meng, Ancong Wu, and Wei-Shi Zheng. 2019. Deep asymmetric video-based person re-identification. Pattern Recognition, Vol. 93 (2019), 430--441.

[23]

Carl D Meyer. 2000. Matrix analysis and applied linear algebra. Vol. 71. Siam.

[24]

Zequn Qin, Pengyi Zhang, Fei Wu, and Xi Li. 2020. FcaNet: Frequency Channel Attention Networks. arXiv preprint arXiv:2012.11879 (2020).

[25]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In IEEE International Conference on Computer Vision. 5533--5541.

[26]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.

Digital Library

[27]

Arulkumar Subramaniam, Athira Nambiar, and Anurag Mittal. 2019. Co-segmentation inspired attention networks for video-based person re-identification. In IEEE International Conference on Computer Vision. 562--572.

[28]

Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In European Conference on Computer Vision. 480--496.

[29]

Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In ACM International Conference on Multimedia. 274--282.

Digital Library

[30]

Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. 2014. Person re-identification by video ranking. In European Conference on Computer Vision. 688--703.

[31]

Yiming Wu, Omar El Farouk Bourahla, Xi Li, Fei Wu, Qi Tian, and Xue Zhou. 2020. Adaptive graph representation learning for video person re-identification. IEEE Transactions on Image Processing, Vol. 29 (2020), 8821--8830.

[32]

Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. 2018. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In IEEE Conference on Computer Vision and Pattern Recognition. 5177--5186.

[33]

Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. 2020. Learning in the frequency domain. In IEEE Conference on Computer Vision and Pattern Recognition. 1740--1749.

[34]

Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou. 2017. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In IEEE International Conference on Computer Vision. 4733--4742.

[35]

Yichao Yan, Bingbing Ni, Zhichao Song, Chao Ma, Yan Yan, and Xiaokang Yang. 2016. Person re-identification via recurrent feature aggregation. In European Conference on Computer Vision. 701--716.

[36]

Jinrui Yang, Wei-Shi Zheng, Qize Yang, Ying-Cong Chen, and Qi Tian. 2020. Spatial-temporal graph convolutional network for video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 3289--3299.

[37]

Richard Zhang. 2019. Making convolutional networks shift-invariant again. In International Conference on Machine Learning. 7324--7334.

[38]

Ruimao Zhang, Jingyu Li, Hongbin Sun, Yuying Ge, Ping Luo, Xiaogang Wang, and Liang Lin. 2019. Scan: Self-and-collaborative attention network for video person re-identification. IEEE Transactions on Image Processing, Vol. 28, 10 (2019), 4870--4882.

[39]

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. 2020. Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-based Person Re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 10407--10416.

[40]

Yiru Zhao, Xu Shen, Zhongming Jin, Hongtao Lu, and Xian-sheng Hua. 2019. Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 4913--4922.

[41]

Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. 2016. Mars: A video benchmark for large-scale person re-identification. In European Conference on Computer Vision. 868--884.

[42]

Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, and Qi Tian. 2017. Person re-identification in the wild. In IEEE Conference on Computer Vision and Pattern Recognition. 1367--1376.

[43]

Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan. 2017. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 4747--4756.

[44]

Xiaoke Zhu, Xiao-Yuan Jing, Xinge You, Xinyu Zhang, and Taiping Zhang. 2018. Video-based person re-identification by simultaneously learning intra-video and inter-video distance metrics. IEEE Transactions on Image Processing, Vol. 27, 11 (2018), 5683--5695.

Cited By

Yang XWang XLiu LWang NGao X(2024)STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature EnhancementIEEE Transactions on Multimedia10.1109/TMM.2024.336213626(7237-7249)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3362136
Wang HCui BYuan QPu GLiu XZhu J(2024)Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformerThe Visual Computer10.1007/s00371-024-03515-yOnline publication date: 11-Jun-2024
https://doi.org/10.1007/s00371-024-03515-y
Wang KDing CPang JXu X(2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3573203
Show More Cited By

Index Terms

Viewing from Frequency Domain: A DCT-based Information Enhancement Network for Video Person Re-Identification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object identification
      2. Computer vision tasks
        Visual content-based indexing and retrieval

Recommendations

The Fast Hartley Transform Algorithm

The fast Hartley transform (FHT) is similar to the Cooley-Tukey fast Fourier transform (FFT) but performs much faster because it requires only real arithmetic computations compared to the complex arithmetic computations required by the FFT. Through use ...
The fast generalized discrete Fourier transforms: a unified approach to the discrete sinusoidal transforms computation
Short communication: the fast DCT-IV/DST-IV computation via the MDCT
Special section: Hans Wilhelm Schüßler celebrates his 75th birthday

The discrete cosine transform of type IV (DCT-IV) and corresponding discrete sine transform of type IV (DST-IV) have played key role in the efficient implementation of orthogonal lapped transforms and perfect reconstruction cosine-modulated filter banks ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundamental Research Funds for the Central Universities
National Natural Science Foundation of China
Young Talent Fund of University Association for Science and Technology, Shaanxi, China
Innovation Fund of Xidian University
Key Research and Development Program of Shaanxi
Innovation Capacity Support Plan of Shaanxi Province

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
381
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)5

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang XWang XLiu LWang NGao X(2024)STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature EnhancementIEEE Transactions on Multimedia10.1109/TMM.2024.336213626(7237-7249)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3362136
Wang HCui BYuan QPu GLiu XZhu J(2024)Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformerThe Visual Computer10.1007/s00371-024-03515-yOnline publication date: 11-Jun-2024
https://doi.org/10.1007/s00371-024-03515-y
Wang KDing CPang JXu X(2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3573203
Chen CWang DSong BTan H(2023)Inter-Intra Modal Representation Augmentation With DCT-Transformer Adversarial Network for Image-Text MatchingIEEE Transactions on Multimedia10.1109/TMM.2023.324366525(8933-8945)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3243665
Chai TChen ZLi AChen JMei XWang Y(2022)Video Person Re-Identification Using Attribute-Enhanced FeaturesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.318902732:11(7951-7966)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/TCSVT.2022.3189027
Pei SFan X(2021)Multi-Level Fusion Temporal–Spatial Co-Attention for Video-Based Person Re-IdentificationEntropy10.3390/e2312168623:12(1686)Online publication date: 15-Dec-2021
https://doi.org/10.3390/e23121686

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents