research-article

Open access

Large-Scale Training Framework for Video Annotation

Authors:

Seong Jae Hwang,

Balakrishnan Varadarajan,

Apostol (Paul) NatsevAuthors Info & Claims

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 2394 - 2402

https://doi.org/10.1145/3292500.3330653

Published: 25 July 2019 Publication History

Abstract

Video is one of the richest sources of information available online but extracting deep insights from video content at internet scale is still an open problem, both in terms of depth and breadth of understanding, as well as scale. Over the last few years, the field of video understanding has made great strides due to the availability of large-scale video datasets and core advances in image, audio, and video modeling architectures. However, the state-of-the-art architectures on small scale datasets are frequently impractical to deploy at internet scale, both in terms of the ability to train such deep networks on hundreds of millions of videos, and to deploy them for inference on billions of videos. In this paper, we present a MapReduce-based training framework, which exploits both data parallelism and model parallelism to scale training of complex video models. The proposed framework uses alternating optimization and full-batch fine-tuning, and supports large Mixture-of-Experts classifiers with hundreds of thousands of mixtures, which enables a trade-off between model depth and breadth, and the ability to shift model capacity between shared (generalization) layers and per-class (specialization) layers. We demonstrate that the proposed framework is able to reach state-of-the-art performance on the largest public video datasets, YouTube-8M and Sports-1M, and can scale to 100 times larger datasets.

Supplementary Material

MP4 File (p2394-hwang.mp4)

Download
1109.42 MB

References

[1]

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8M: A Large-scale Video Classification Benchmark. arXiv preprint arXiv:1609.08675 (2016).

[2]

Vladimir Aliev, Pavel Ostyakov, Roman Suvorov, Gleb Sterkin, Elizaveta Logacheva, Oleg Khomenko, and Sergey Nikolenko. 2018. Label Denoising with Large Ensembles of Heterogeneous Neural Networks. In Proc. of the 2nd Workshop on YouTube-8M Large-Scale Video Understanding.

[3]

Shaoxiang Chen, Xi Wang, Yongyi Tang, Xinpeng Chen, Zuxuan Wu, and Yu-Gang Jiang. 2017. Aggregating Frame-level Features for Large-Scale Video Classification. In Proc. of the CVPR Workshop on YouTube-8M Large-Scale Video Understanding.

[4]

Choongyeun Cho, Benjamin Antin, Sanchit Arora, Shwan Ashrafi, Peilin Duan, Dang The Huynh, Lee James, Hang Tuan Nguyen, Moji Solgi, and Cuong Van Than. 2018. Axon AI's Solution to the 2nd YouTube-8M Video Understanding Challenge. In Proc. of the 2nd Workshop on YouTube-8M Large-Scale Video Understanding.

[5]

Cheng-Tao Chu, Sang K Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Kunle Olukotun, and Andrew Y Ng. 2007. Map-Reduce for machine learning on multicore. In Advances in neural information processing systems (NIPS).

Digital Library

[6]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et almbox. 2012. Large scale distributed deep networks. In Advances in neural information processing systems (NIPS).

Digital Library

[7]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM, Vol. 51, 1 (2008), 107--113.

Digital Library

[8]

Basura Fernando, Efstratios Gavves, José Oramas, Amir Ghodrati, and Tinne Tuytelaars. 2017. Rank pooling for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 4 (2017), 773--787.

Digital Library

[9]

Shivam Garg. 2018. Learning Video Features for Multi-Label Classification. In Proc. of the 2nd Workshop on YouTube-8M Large-Scale Video Understanding.

[10]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. of the IEEE conference on computer vision and pattern recognition (CVPR).

[12]

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et almbox. 2017. CNN architectures for large-scale audio classification. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[14]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely Connected Convolutional Networks. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]

Christian Igel and Michael Hüsken. 2003. Empirical evaluation of the improved Rprop learning algorithms. Neurocomputing, Vol. 50 (2003), 105--123.

[16]

Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, Vol. 6, 2 (1994), 181--214.

Digital Library

[17]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proc. of the IEEE conference on Computer Vision and Pattern Recognition (CVPR).

Digital Library

[18]

Eun-Sol Kim, Jongseok Kim, Kyoung-Woon On, Yu-Jung Heo, Seong-Ho Choi, Hyun-Dong Lee, and Byoung-Tak Zhang. 2018. Temporal Attention Mechanism with Conditional Inference for Large-Scale Multi-Label Video Classification. In Proc. of the 2nd Workshop on YouTube-8M Large-Scale Video Understanding.

[19]

Sebastian Kmiec and Juhan Bae. 2018. Learnable Pooling Methods for Video Classification. In Proc. of the 2nd Workshop on YouTube-8M Large-Scale Video Understanding.

[20]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.

Digital Library

[21]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.

Digital Library

[22]

Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic human actions from movies. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]

Joonseok Lee, Sami Abu-El-Haija, Balakrishnan Varadarajan, and Apostol Natsev. 2018a. Collaborative Deep Metric Learning for Video Understanding. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[24]

Joonseok Lee, Apostol Natsev, Walter Reade, Rahul Sukthankar, and George Toderici. 2018b. The 2nd YouTube-8M Large-Scale Video Understanding Challenge. In Proc. of the European Conference on Computer Vision (ECCV).

[25]

Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, and Shilei Wen. 2017. Temporal modeling approaches for large-scale Youtube-8M video understanding. In Proc. of the CVPR'17 Workshop on YouTube-8M Large-Scale Video Understanding.

[26]

Xiaodan Liang, Lisa Lee, and Eric P Xing. 2017. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]

Rongcheng Lin, Jing Xiao, and Jianping Fan. 2018. NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification. In Proc. of the 2nd Workshop on YouTube-8M Large-Scale Video Understanding.

[28]

Tie-Yan Liu, Wei Chen, and Taifeng Wang. 2017. Distributed machine learning: Foundations, trends, and practices. In Proceedings of the 26th International Conference on World Wide Web Companion. 913--915.

Digital Library

[29]

Antoine Miech, Ivan Laptev, and Josef Sivic. 2017. Learnable Pooling with Context Gating for Video Classification. In Proc. of the CVPR Workshop on YouTube-8M Large-Scale Video Understanding.

[30]

Xinghao Pan, Maximilian Lam, Stephen Tu, Dimitris Papailiopoulos, Ce Zhang, Michael I Jordan, Kannan Ramchandran, and Christopher Ré. 2016. Cyclades: Conflict-free asynchronous machine learning. In Advances in Neural Information Processing Systems. 2568--2576.

Digital Library

[31]

Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 5534--5542.

[32]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS).

Digital Library

[33]

Martin Riedmiller and Heinrich Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proc. of the IEEE International Conference on Neural Networks.

[34]

Samuel L Smith, Pieter-Jan Kindermans, and Quoc V Le. 2017. Don't Decay the Learning Rate, Increase the Batch Size. arXiv preprint arXiv:1711.00489 (2017).

[35]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proc. of the AAAI Conference on Artificial Intelligence.

Digital Library

[36]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proc. of the IEEE conference on computer vision and pattern recognition (CVPR).

[37]

Yongyi Tang, Xing Zhang, Jingwen Wang, Shaoxiang Chen, Lin Ma, and Yu-Gang Jiang. 2018. Non-local NetVLAD Encoding for Video Classification. In Proc. of the 2nd Workshop on YouTube-8M Large-Scale Video Understanding.

[38]

Gül Varol, Ivan Laptev, and Cordelia Schmid. 2018. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 6 (2018), 1510--1517.

[39]

Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. 2009. Evaluation of local spatio-temporal features for action recognition. In Proc. of the British Machine Vision Conference (BMVC).

[40]

He-Da Wang, Teng Zhang, and Ji Wu. 2017. The Monkeytyping Solution to the Youtube-8M Video Understanding Challenge. In Proc. of the CVPR Workshop on YouTube-8M Large-Scale Video Understanding.

[41]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888 (2017).

[42]

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proc. of the IEEE conference on computer vision and pattern recognition (CVPR).

[43]

Ruiliang Zhang and James Kwok. 2014. Asynchronous distributed ADMM for consensus optimization. In Proc. of the International Conference on Machine Learning (ICML).

Digital Library

[44]

Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems.

Digital Library

[45]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).

[47]

Linchao Zhu, Yanbin Liu, and Yi Yang. 2017. UTS submission to Google YouTube-8M Challenge 2017. In Proc. of the CVPR'17 Workshop on YouTube-8M Large-Scale Video Understanding.

[48]

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Cited By

Lee HLee JNg JNatsev P(2020)Large Scale Video Representation Learning via Relational Graph Clustering2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.00684(6806-6815)Online publication date: Jun-2020
https://doi.org/10.1109/CVPR42600.2020.00684

Index Terms

Large-Scale Training Framework for Video Annotation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. MapReduce algorithms

Recommendations

Semi-supervised multi-instance multi-label learning for video annotation task
MM '12: Proceedings of the 20th ACM international conference on Multimedia

Traditional approaches for automatic video annotation usually represent one video clip with a flat feature vector, neglecting the fact that video data contain natural structures. It is also noteworthy that a video clip is often relevant to multiple ...
To construct optimal training set for video annotation
MM '06: Proceedings of the 14th ACM international conference on Multimedia

This paper exploits the criteria to optimize the training set construction for video annotation. Most existing learning-based semantic annotation approaches require a large training set to achieve good generalization capacity, in which a considerable ...
Automatic video annotation by semi-supervised learning with kernel density estimation
MM '06: Proceedings of the 14th ACM international conference on Multimedia

Insufficiency of labeled training data is a major obstacle for automatically annotating large-scale video databases with semantic concepts. Existing semi-supervised learning algorithms based on parametric models try to tackle this issue by incorporating ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2019

3305 pages

ISBN:9781450362016

DOI:10.1145/3292500

General Chairs:
Ankur Teredesai
KenSci
,
Vipin Kumar
University of Minnesota
,
Program Chairs:
Ying Li
EV Analysis Corporation
,
Rómer Rosales
LinkedIn
,
Evimaria Terzi
Boston University
,
George Karypis
University of Minnesota

Copyright © 2019 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2019

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '19

Sponsor:

KDD '19: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 4 - 8, 2019

AK, Anchorage, USA

Acceptance Rates

KDD '19 Paper Acceptance Rate 110 of 1,200 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
1,028
Total Downloads

Downloads (Last 12 months)386
Downloads (Last 6 weeks)36

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lee HLee JNg JNatsev P(2020)Large Scale Video Representation Learning via Relational Graph Clustering2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR42600.2020.00684(6806-6815)Online publication date: Jun-2020
https://doi.org/10.1109/CVPR42600.2020.00684

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents