research-article

Play and Rewind: Optimizing Binary Representations of Videos by Self-Supervised Temporal Hashing

Authors:

Tat-Seng ChuaAuthors Info & Claims

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 781 - 790

https://doi.org/10.1145/2964284.2964308

Published: 01 October 2016 Publication History

Abstract

We focus on hashing videos into short binary codes for efficient Content-based Video Retrieval (CBVR), which is a fundamental technique that supports access to the ever-growing abundance of videos on the Web. Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss. In this paper, we propose a novel unsupervised video hashing framework called Self-Supervised Temporal Hashing (SSTH) that is able to capture the temporal nature of videos in an end-to-end learning-to-hash fashion. Specifically, the hash function of SSTH is an encoder RNN equipped with the proposed Binary LSTM (BLSTM) that generates binary codes for videos. The hash function is learned in a self-supervised fashion, where a decoder RNN is proposed to reconstruct the original video frames in both forward and reverse orders. For binary code optimization, we develop a backpropagation rule that tackles the non-differentiability of BLSTM. This rule allows efficient deep network training without suffering from the binarization loss. Through extensive CBVR experiments on two real-world consumer video datasets of Youtube and Flickr, we show that SSTH consistently outperforms state-of-the-art video hashing methods, eg., in terms of mAP@20, SSTH using only 128 bits can still outperform others using 256 bits by at least 9% to 15% on both datasets.

References

[1]

Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.

[2]

A. Bergamo, L. Torresani, and A. W. Fitzgibbon. Picodes: Learning a compact code for novel-category recognition. In NIPS, 2011.

Digital Library

[3]

L. Cao, Z. Li, Y. Mu, and S.-F. Chang. Submodular video hashing: a unified framework towards video pooling and indexing. In MM, 2012.

Digital Library

[4]

M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or-1. arXiv preprint arXiv:1602.02830, 2016.

[5]

R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 2008.

Digital Library

[6]

T.-T. Do, A.-D. Doan, and N.-M. Cheung. Learning to hash with binary deep neural network. In ECCV, 2016.

[7]

J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.

[8]

V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In CVPR, 2015.

[9]

A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, 1999.

Digital Library

[10]

X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In ICAIS, 2011.

[11]

Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. TPAMI, 2013.

Digital Library

[12]

A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 2005.

Digital Library

[13]

J. Håstad. Some optimal inapproximability results. JACM, 2001.

Digital Library

[14]

W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank. A survey on visual content-based video indexing and retrieval. TSMC Part C, 2011.

Digital Library

[15]

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Digital Library

[16]

Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang. Exploiting feature and class relationships in video categorization with regularized deep neural networks. arXiv preprint arXiv:1502.07209, 2015.

[17]

M. S. Lew, N. Sebe, C. Djeraba, and R. Jain. Content-based multimedia information retrieval: State of the art and challenges. TOMM, 2006.

Digital Library

[18]

H. Li. Multimodal visual pattern mining with convolutional neural networks. In ICMR, 2016.

Digital Library

[19]

M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. In CVPR, 2012.

Digital Library

[20]

P. Over, J. Fiscus, G. Sanders, D. Joy, M. Michel, G. Awad, A. Smeaton, W. Kraaij, and G. Quénot. Trecvid 2014--an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID, page 52, 2014.

[21]

J. Revaud, M. Douze, C. Schmid, and H. Jégou. Event retrieval in large video collections with circulant temporal encoding. In CVPR, 2013.

Digital Library

[22]

F. Shen, C. Shen, W. Liu, and H. Tao Shen. Supervised discrete hashing. In CVPR, 2015.

[23]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[24]

A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. TPAMI, 2000.

Digital Library

[25]

C. G. Snoek and M. Worring. Concept-based video retrieval. FTIR, 2008.

Digital Library

[26]

J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In MM, 2011.

Digital Library

[27]

N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, 2015.

Digital Library

[28]

I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.

Digital Library

[29]

B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, 2015.

[30]

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.

[31]

H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.

Digital Library

[32]

J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale search. TPAMI, 2012.

Digital Library

[33]

J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data: A survey. Proceedings of the IEEE, 2016.

[34]

Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2009.

Digital Library

[35]

Z. Wu, Y. Fu, Y.-G. Jiang, and L. Sigal. Harnessing object and scene semantics for large-scale video understanding. In CVPR, 2016.

[36]

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.

[37]

Y. Yang, Z.-J. Zha, Y. Gao, X. Zhu, and T.-S. Chua. Exploiting web images for semantic video indexing via robust sample-specific loss. TMM, 2014.

[38]

G. Ye, D. Liu, J. Wang, and S.-F. Chang. Large-scale video hashing via structure learning. In CVPR, 2013.

Digital Library

[39]

Z.-J. Zha, M. Wang, Y.-T. Zheng, Y. Yang, R. Hong, and T.-S. Chua. Interactive video indexing with statistical active learning. TMM, 2012.

Digital Library

[40]

Z.-J. Zha, H. Zhang, M. Wang, H. Luan, and T.-S. Chua. Detecting group activities with multi-camera context. TCSVT, 2013.

Digital Library

[41]

H. Zhang, F. Shen, W. Liu, X. He, H. Luan, and T.-S. Chua. Discrete collaborative filtering. In SIGIR, 2016.

Digital Library

[42]

H. Zhang, Z.-J. Zha, Y. Yang, S. Yan, Y. Gao, and T.-S. Chua. Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In MM, 2013.

Digital Library

[43]

F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multi-label image retrieval. In CVPR, 2015.

Cited By

Yang JZhou R(2024)Whisper40: A Multi-Person Chinese Whisper Speaker Recognition Dataset Containing Same-Text Neutral SpeechInformation10.3390/info1504018415:4(184)Online publication date: 28-Mar-2024
https://doi.org/10.3390/info15040184
Shen XHuang QLan LZheng YLarson K(2024)Contrastive transformer cross-modal hashing for video-text retrievalProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/136(1227-1235)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/136
Shuiling ZZhaoxian LJiaxiong ZLongfei DCairong Z(2024)Chemical structure recognition method based on attention mechanism and encoder-decoder architectureJournal of Image and Graphics10.11834/jig.23036729:7(1960-1969)Online publication date: 2024
https://doi.org/10.11834/jig.230367
Show More Cited By

Index Terms

Play and Rewind: Optimizing Binary Representations of Videos by Self-Supervised Temporal Hashing
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Unsupervised Video Hashing with Multi-granularity Contextualization and Multi-structure Preservation
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Unsupervised video hashing typically aims to learn a compact binary vector to represent complex video content without using manual annotations. Existing unsupervised hashing methods generally suffer from incomplete exploration of various perspective ...
Supervised Recurrent Hashing for Large Scale Video Retrieval
MM '16: Proceedings of the 24th ACM international conference on Multimedia

Hashing for large-scale multimedia is a popular research topic, attracting much attention in computer vision and visual information retrieval. Previous works mostly focus on hashing the images and texts while the approaches designed for videos are ...
Fast action retrieval from videos via feature disaggregation

We propose a novel hashing scheme, namely DH, for high-dimensional video data.Feature disaggregation is proposed by exploiting correlations among dimensions.Hash function is learned independently on feature clusters by greedy optimization.DH can ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '16: Proceedings of the 24th ACM international conference on Multimedia

October 2016

1542 pages

ISBN:9781450336031

DOI:10.1145/2964284

General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '16

Sponsor:

SIGMM

MM '16: ACM Multimedia Conference

October 15 - 19, 2016

Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

66
Total Citations
View Citations
678
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang JZhou R(2024)Whisper40: A Multi-Person Chinese Whisper Speaker Recognition Dataset Containing Same-Text Neutral SpeechInformation10.3390/info1504018415:4(184)Online publication date: 28-Mar-2024
https://doi.org/10.3390/info15040184
Shen XHuang QLan LZheng YLarson K(2024)Contrastive transformer cross-modal hashing for video-text retrievalProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/136(1227-1235)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/136
Shuiling ZZhaoxian LJiaxiong ZLongfei DCairong Z(2024)Chemical structure recognition method based on attention mechanism and encoder-decoder architectureJournal of Image and Graphics10.11834/jig.23036729:7(1960-1969)Online publication date: 2024
https://doi.org/10.11834/jig.230367
Huang QPeng SShen XYuan YPan SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Similarity Preserving Transformer Cross-Modal Hashing for Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681606(5883-5891)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681606
Li QTian XNg W(2024)Self-Supervised Temporal Sensitive Hashing for Video RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338518326(9021-9035)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3385183
Duan JHao YZhu BCheng LZhou PWang X(2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3368924
Yang YWang HWang JDong KDing S(2024)Semantic-Preserving Surgical Video Retrieval With Phase and Behavior Coordinated HashingIEEE Transactions on Medical Imaging10.1109/TMI.2023.332138243:2(807-819)Online publication date: Feb-2024
https://doi.org/10.1109/TMI.2023.3321382
Liu CTian YYu SGao DWu YHuang SHu XChen N(2024)XDrain: Effective log parsing in log streams using fixed-depth forestInformation and Software Technology10.1016/j.infsof.2024.107546(107546)Online publication date: Aug-2024
https://doi.org/10.1016/j.infsof.2024.107546
Munawar MNoreen IAlharthi RSarwar N(2023)Forged Video Detection Using Deep LearningApplied Computational Intelligence and Soft Computing10.1155/2023/66611922023Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1155/2023/6661192
Wei RLiu YSong JCui HXie YZhou KEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video HashingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613440(1677-1688)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613440
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten