Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3474085.3475566acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Viewing from Frequency Domain: A DCT-based Information Enhancement Network for Video Person Re-Identification

Published: 17 October 2021 Publication History

Abstract

Video-based person re-identification (Re-ID) aims to match the target pedestrians under non-overlapping camera system by video tracklets. The key issue of video Re-ID focuses on exploring effective spatio-temporal features. Generally, the spatio-temporal information of a video sequence can be divided into two aspects: the discriminative information in each frame and the shared information over the whole sequence. To make full use of the rich information in video sequences, this paper proposes a Discrete Cosine Transform based Information Enhancement Network (DCT-IEN) to achieve more comprehensive spatio-temporal representation from frequency domain. Inspired by the principle that average pooling is one of the special frequency components in DCT (the lowest frequency component), DCT-IEN first adopts discrete cosine transform to convert the extracted feature maps into frequency domain, thereby retaining more information that embedded in different frequency components. With the help of DCT frequency spectrum, two branches are adopted to learn the final video representation: Frequency Selection Module (FSM) and Lowest Frequency Enhancement Module (LFEM). FSM explores the most discriminative features in each frame by aggregating different frequency components with attention mechanism. LFEM enhances the shared feature over the whole video sequence by frame feature regularization. By fusing these two kinds of features together, DCT-IEN finally achieves comprehensive video representation. We conduct extensive experiments on two widely used datasets. The experimental results verify our idea and demonstrate the effectiveness of DCT-IEN for video-based Re-ID.

References

[1]
Guangyi Chen, Yongming Rao, Jiwen Lu, and Jie Zhou. 2020. Temporal Coherence or Temporal Motion: Which Is More Critical for Video-Based Person Re-identification?. In European Conference on Computer Vision. 660--676.
[2]
Wenlin Chen, James Wilson, Stephen Tyree, Kilian Q Weinberger, and Yixin Chen. 2016. Compressing convolutional neural networks in the frequency domain. In ACM International Conference on Knowledge Discovery and Data Mining. 1475--1484.
[3]
Dahjung Chung, Khalid Tahboub, and Edward J Delp. 2017. A two stream siamese convolutional neural network for person re-identification. In IEEE International Conference on Computer Vision. 1983--1991.
[4]
Max Ehrlich and Larry S Davis. [n.d.]. Deep residual learning in the jpeg transform domain. In IEEE International Conference on Computer Vision. 3484--3493.
[5]
Yang Fu, Xiaoyang Wang, Yunchao Wei, and Thomas Huang. 2019. Sta: Spatial-temporal attention for large-scale video-based person re-identification. In AAAI Conference on Artificial Intelligence. 8287--8294.
[6]
Jiyang Gao and Ram Nevatia. 2018. Revisiting temporal modeling for video-based person reid. arXiv preprint arXiv:1805.02104 (2018).
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[8]
Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).
[9]
Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2020. Temporal complementary learning for video person re-identification. In European Conference on Computer Vision. 388--405.
[10]
Ruibing Hou, Bingpeng Ma, Hong Chang, Xinqian Gu, Shiguang Shan, and Xilin Chen. 2019. Vrstc: Occlusion-free video person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 7183--7192.
[11]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7132--7141.
[12]
Jianing Li, Jingdong Wang, Qi Tian, Wen Gao, and Shiliang Zhang. 2019 a. Global-local temporal representations for video person re-identification. In IEEE International Conference on Computer Vision. 3958--3967.
[13]
Jianing Li, Shiliang Zhang, and Tiejun Huang. 2019 b. Multi-scale 3d convolution network for video based person re-identification. In AAAI Conference on Artificial Intelligence. 8618--8625.
[14]
Jianing Li, Shiliang Zhang, and Tiejun Huang. 2020. Multi-Scale Temporal Cues Learning for Video Person Re-Identification. IEEE Transactions on Image Processing, Vol. 29 (2020), 4461--4473.
[15]
Shuang Li, Slawomir Bak, Peter Carr, and Xiaogang Wang. 2018. Diversity regularized spatiotemporal attention for video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 369--378.
[16]
Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In IEEE Conference on Computer Vision and Pattern Recognition. 2197--2206.
[17]
Chih-Ting Liu, Chih-Wei Wu, Yu-Chiang Frank Wang, and Shao-Yi Chien. 2019 a. Spatially and temporally efficient non-local attention network for video-based person re-identification. arXiv preprint arXiv:1908.01683 (2019).
[18]
Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, and Jiashi Feng. 2017. Video-based person re-identification with accumulative motion context. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, 10 (2017), 2788--2802.
[19]
Jiawei Liu, Zheng-Jun Zha, Xuejin Chen, Zilei Wang, and Yongdong Zhang. 2019 c. Dense 3D-convolutional neural network for person re-identification in videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 15, 1s (2019), 1--19.
[20]
Yiheng Liu, Zhenxun Yuan, Wengang Zhou, and Houqiang Li. 2019 b. Spatial and temporal mutual promotion for video-based person re-identification. In AAAI Conference on Artificial Intelligence. 8786--8793.
[21]
Niall McLaughlin, Jesus Martinez Del Rincon, and Paul Miller. 2016. Recurrent convolutional network for video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 1325--1334.
[22]
Jingke Meng, Ancong Wu, and Wei-Shi Zheng. 2019. Deep asymmetric video-based person re-identification. Pattern Recognition, Vol. 93 (2019), 430--441.
[23]
Carl D Meyer. 2000. Matrix analysis and applied linear algebra. Vol. 71. Siam.
[24]
Zequn Qin, Pengyi Zhang, Fei Wu, and Xi Li. 2020. FcaNet: Frequency Channel Attention Networks. arXiv preprint arXiv:2012.11879 (2020).
[25]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In IEEE International Conference on Computer Vision. 5533--5541.
[26]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.
[27]
Arulkumar Subramaniam, Athira Nambiar, and Anurag Mittal. 2019. Co-segmentation inspired attention networks for video-based person re-identification. In IEEE International Conference on Computer Vision. 562--572.
[28]
Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In European Conference on Computer Vision. 480--496.
[29]
Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In ACM International Conference on Multimedia. 274--282.
[30]
Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. 2014. Person re-identification by video ranking. In European Conference on Computer Vision. 688--703.
[31]
Yiming Wu, Omar El Farouk Bourahla, Xi Li, Fei Wu, Qi Tian, and Xue Zhou. 2020. Adaptive graph representation learning for video person re-identification. IEEE Transactions on Image Processing, Vol. 29 (2020), 8821--8830.
[32]
Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. 2018. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In IEEE Conference on Computer Vision and Pattern Recognition. 5177--5186.
[33]
Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. 2020. Learning in the frequency domain. In IEEE Conference on Computer Vision and Pattern Recognition. 1740--1749.
[34]
Shuangjie Xu, Yu Cheng, Kang Gu, Yang Yang, Shiyu Chang, and Pan Zhou. 2017. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In IEEE International Conference on Computer Vision. 4733--4742.
[35]
Yichao Yan, Bingbing Ni, Zhichao Song, Chao Ma, Yan Yan, and Xiaokang Yang. 2016. Person re-identification via recurrent feature aggregation. In European Conference on Computer Vision. 701--716.
[36]
Jinrui Yang, Wei-Shi Zheng, Qize Yang, Ying-Cong Chen, and Qi Tian. 2020. Spatial-temporal graph convolutional network for video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 3289--3299.
[37]
Richard Zhang. 2019. Making convolutional networks shift-invariant again. In International Conference on Machine Learning. 7324--7334.
[38]
Ruimao Zhang, Jingyu Li, Hongbin Sun, Yuying Ge, Ping Luo, Xiaogang Wang, and Liang Lin. 2019. Scan: Self-and-collaborative attention network for video person re-identification. IEEE Transactions on Image Processing, Vol. 28, 10 (2019), 4870--4882.
[39]
Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, and Zhibo Chen. 2020. Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-based Person Re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 10407--10416.
[40]
Yiru Zhao, Xu Shen, Zhongming Jin, Hongtao Lu, and Xian-sheng Hua. 2019. Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 4913--4922.
[41]
Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. 2016. Mars: A video benchmark for large-scale person re-identification. In European Conference on Computer Vision. 868--884.
[42]
Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, and Qi Tian. 2017. Person re-identification in the wild. In IEEE Conference on Computer Vision and Pattern Recognition. 1367--1376.
[43]
Zhen Zhou, Yan Huang, Wei Wang, Liang Wang, and Tieniu Tan. 2017. See the forest for the trees: Joint spatial and temporal recurrent neural networks for video-based person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition. 4747--4756.
[44]
Xiaoke Zhu, Xiao-Yuan Jing, Xinge You, Xinyu Zhang, and Taiping Zhang. 2018. Video-based person re-identification by simultaneously learning intra-video and inter-video distance metrics. IEEE Transactions on Image Processing, Vol. 27, 11 (2018), 5683--5695.

Cited By

View all
  • (2024)STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature EnhancementIEEE Transactions on Multimedia10.1109/TMM.2024.336213626(7237-7249)Online publication date: 2024
  • (2024)Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformerThe Visual Computer10.1007/s00371-024-03515-yOnline publication date: 11-Jun-2024
  • (2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
  • Show More Cited By

Index Terms

  1. Viewing from Frequency Domain: A DCT-based Information Enhancement Network for Video Person Re-Identification

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 October 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. discrete cosine transform
      2. spatio-temporal feature learning
      3. video-based person re-identification

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MM '21
      Sponsor:
      MM '21: ACM Multimedia Conference
      October 20 - 24, 2021
      Virtual Event, China

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)46
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 18 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature EnhancementIEEE Transactions on Multimedia10.1109/TMM.2024.336213626(7237-7249)Online publication date: 2024
      • (2024)Mini-3DCvT: a lightweight lip-reading method based on 3D convolution visual transformerThe Visual Computer10.1007/s00371-024-03515-yOnline publication date: 11-Jun-2024
      • (2023)Context Sensing Attention Network for Video-based Person Re-identificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357320319:4(1-20)Online publication date: 27-Feb-2023
      • (2023)Inter-Intra Modal Representation Augmentation With DCT-Transformer Adversarial Network for Image-Text MatchingIEEE Transactions on Multimedia10.1109/TMM.2023.324366525(8933-8945)Online publication date: 1-Jan-2023
      • (2022)Video Person Re-Identification Using Attribute-Enhanced FeaturesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2022.318902732:11(7951-7966)Online publication date: 1-Nov-2022
      • (2021)Multi-Level Fusion Temporal–Spatial Co-Attention for Video-Based Person Re-IdentificationEntropy10.3390/e2312168623:12(1686)Online publication date: 15-Dec-2021

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media