Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2964284.2964308acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Play and Rewind: Optimizing Binary Representations of Videos by Self-Supervised Temporal Hashing

Published: 01 October 2016 Publication History

Abstract

We focus on hashing videos into short binary codes for efficient Content-based Video Retrieval (CBVR), which is a fundamental technique that supports access to the ever-growing abundance of videos on the Web. Existing video hash functions are built on three isolated stages: frame pooling, relaxed learning, and binarization, which have not adequately explored the temporal order of video frames in a joint binary optimization model, resulting in severe information loss. In this paper, we propose a novel unsupervised video hashing framework called Self-Supervised Temporal Hashing (SSTH) that is able to capture the temporal nature of videos in an end-to-end learning-to-hash fashion. Specifically, the hash function of SSTH is an encoder RNN equipped with the proposed Binary LSTM (BLSTM) that generates binary codes for videos. The hash function is learned in a self-supervised fashion, where a decoder RNN is proposed to reconstruct the original video frames in both forward and reverse orders. For binary code optimization, we develop a backpropagation rule that tackles the non-differentiability of BLSTM. This rule allows efficient deep network training without suffering from the binarization loss. Through extensive CBVR experiments on two real-world consumer video datasets of Youtube and Flickr, we show that SSTH consistently outperforms state-of-the-art video hashing methods, eg., in terms of mAP@20, SSTH using only 128 bits can still outperform others using 256 bits by at least 9% to 15% on both datasets.

References

[1]
Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[2]
A. Bergamo, L. Torresani, and A. W. Fitzgibbon. Picodes: Learning a compact code for novel-category recognition. In NIPS, 2011.
[3]
L. Cao, Z. Li, Y. Mu, and S.-F. Chang. Submodular video hashing: a unified framework towards video pooling and indexing. In MM, 2012.
[4]
M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or-1. arXiv preprint arXiv:1602.02830, 2016.
[5]
R. Datta, D. Joshi, J. Li, and J. Z. Wang. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys, 2008.
[6]
T.-T. Do, A.-D. Doan, and N.-M. Cheung. Learning to hash with binary deep neural network. In ECCV, 2016.
[7]
J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
[8]
V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep hashing for compact binary codes learning. In CVPR, 2015.
[9]
A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB, 1999.
[10]
X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In ICAIS, 2011.
[11]
Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. TPAMI, 2013.
[12]
A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 2005.
[13]
J. Håstad. Some optimal inapproximability results. JACM, 2001.
[14]
W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank. A survey on visual content-based video indexing and retrieval. TSMC Part C, 2011.
[15]
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[16]
Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang. Exploiting feature and class relationships in video categorization with regularized deep neural networks. arXiv preprint arXiv:1502.07209, 2015.
[17]
M. S. Lew, N. Sebe, C. Djeraba, and R. Jain. Content-based multimedia information retrieval: State of the art and challenges. TOMM, 2006.
[18]
H. Li. Multimodal visual pattern mining with convolutional neural networks. In ICMR, 2016.
[19]
M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. In CVPR, 2012.
[20]
P. Over, J. Fiscus, G. Sanders, D. Joy, M. Michel, G. Awad, A. Smeaton, W. Kraaij, and G. Quénot. Trecvid 2014--an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID, page 52, 2014.
[21]
J. Revaud, M. Douze, C. Schmid, and H. Jégou. Event retrieval in large video collections with circulant temporal encoding. In CVPR, 2013.
[22]
F. Shen, C. Shen, W. Liu, and H. Tao Shen. Supervised discrete hashing. In CVPR, 2015.
[23]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[24]
A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. TPAMI, 2000.
[25]
C. G. Snoek and M. Worring. Concept-based video retrieval. FTIR, 2008.
[26]
J. Song, Y. Yang, Z. Huang, H. T. Shen, and R. Hong. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In MM, 2011.
[27]
N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
[28]
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
[29]
B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817, 2015.
[30]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
[31]
H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
[32]
J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale search. TPAMI, 2012.
[33]
J. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data: A survey. Proceedings of the IEEE, 2016.
[34]
Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2009.
[35]
Z. Wu, Y. Fu, Y.-G. Jiang, and L. Sigal. Harnessing object and scene semantics for large-scale video understanding. In CVPR, 2016.
[36]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
[37]
Y. Yang, Z.-J. Zha, Y. Gao, X. Zhu, and T.-S. Chua. Exploiting web images for semantic video indexing via robust sample-specific loss. TMM, 2014.
[38]
G. Ye, D. Liu, J. Wang, and S.-F. Chang. Large-scale video hashing via structure learning. In CVPR, 2013.
[39]
Z.-J. Zha, M. Wang, Y.-T. Zheng, Y. Yang, R. Hong, and T.-S. Chua. Interactive video indexing with statistical active learning. TMM, 2012.
[40]
Z.-J. Zha, H. Zhang, M. Wang, H. Luan, and T.-S. Chua. Detecting group activities with multi-camera context. TCSVT, 2013.
[41]
H. Zhang, F. Shen, W. Liu, X. He, H. Luan, and T.-S. Chua. Discrete collaborative filtering. In SIGIR, 2016.
[42]
H. Zhang, Z.-J. Zha, Y. Yang, S. Yan, Y. Gao, and T.-S. Chua. Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In MM, 2013.
[43]
F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multi-label image retrieval. In CVPR, 2015.

Cited By

View all
  • (2024)Whisper40: A Multi-Person Chinese Whisper Speaker Recognition Dataset Containing Same-Text Neutral SpeechInformation10.3390/info1504018415:4(184)Online publication date: 28-Mar-2024
  • (2024)Chemical structure recognition method based on attention mechanism and encoder-decoder architectureJournal of Image and Graphics10.11834/jig.23036729:7(1960-1969)Online publication date: 2024
  • (2024)Similarity Preserving Transformer Cross-Modal Hashing for Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681606(5883-5891)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. binary lstm
  2. sequence learning
  3. temporal hashing
  4. video retrieval

Qualifiers

  • Research-article

Conference

MM '16
Sponsor:
MM '16: ACM Multimedia Conference
October 15 - 19, 2016
Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)9
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Whisper40: A Multi-Person Chinese Whisper Speaker Recognition Dataset Containing Same-Text Neutral SpeechInformation10.3390/info1504018415:4(184)Online publication date: 28-Mar-2024
  • (2024)Chemical structure recognition method based on attention mechanism and encoder-decoder architectureJournal of Image and Graphics10.11834/jig.23036729:7(1960-1969)Online publication date: 2024
  • (2024)Similarity Preserving Transformer Cross-Modal Hashing for Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681606(5883-5891)Online publication date: 28-Oct-2024
  • (2024)Self-Supervised Temporal Sensitive Hashing for Video RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.338518326(9021-9035)Online publication date: 2024
  • (2024)Efficient Unsupervised Video Hashing With Contextual Modeling and Structural ControllingIEEE Transactions on Multimedia10.1109/TMM.2024.336892426(7438-7450)Online publication date: 2024
  • (2024)Semantic-Preserving Surgical Video Retrieval With Phase and Behavior Coordinated HashingIEEE Transactions on Medical Imaging10.1109/TMI.2023.332138243:2(807-819)Online publication date: Feb-2024
  • (2024)XDrain: Effective log parsing in log streams using fixed-depth forestInformation and Software Technology10.1016/j.infsof.2024.107546(107546)Online publication date: Aug-2024
  • (2023)Forged Video Detection Using Deep LearningApplied Computational Intelligence and Soft Computing10.1155/2023/66611922023Online publication date: 1-Jan-2023
  • (2023)CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video HashingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613440(1677-1688)Online publication date: 26-Oct-2023
  • (2023)Contrastive Transformer Hashing for Compact Video RepresentationIEEE Transactions on Image Processing10.1109/TIP.2023.332699432(5992-6003)Online publication date: 2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media