Enhancing multimodal deep representation learning by fixed model reuse

Zhongwei Xie ORCID: orcid.org/0000-0001-6346-0707¹,
Lin Li¹,
Xian Zhong¹,
Yang He¹ &
…
Luo Zhong¹

444 Accesses
2 Citations
Explore all metrics

Abstract

As we all know, inconsistent distribution and representation of different modalities, such as image, text and audio, cause the “media gap”, which poses a great challenge to deal with such heterogeneous data. Currently, state-of-the-art multimodal approaches mainly focus on the data provided by target task, neglecting the extra information on different but related tasks. In this paper, we explore a multimodal representation learning architecture by leveraging embedding representation trained from extra information. Specifically speaking, the approach of fixed model reuse is integrated into our architecture, which can incorporate helpful information from existing models/features into a new model. Based on our proposed architecture, we study multilingual OCR and long-text-based image retrieval tasks. Multilingual OCR is a difficult task that deals with multiple languages on the same page. We take advantage of extra textual embedding layer in an existing text-generating model to improve the accuracy of multilingual OCR. As for the long-text-based image retrieval, a cross-modal task, intermediate visual embedding layer in an off-the-shelf image-captioning model is leveraged to enhance the retrieval ability. The experimental results validate the effectiveness of our proposed architecture on narrowing down the “media gap” and yield observable improvement in these two tasks. Our architecture outperform the state-of-the-art approaches by 4.2% improvements in terms of accuracy in multilingual OCR task and yields improvement from 9 to 6 with regard to the median rank of retrieval result in long-text-based image retrieval task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Convolutional Neural Network for Bidirectional Image-Sentence Mapping

Learning to Learn from Web Data Through Deep Semantic Embeddings

A deep semantic framework for multimodal representation learning

Article 03 March 2016

Notes

References

Ah-Pine J, Csurka G (2011) Semantic combination of textual and visual information in multimedia retrieval. In: ACM International conference on multimedia retrieval, p 44
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on international conference on machine learning, pp III–1247
Baird HS (1993) Document image defect models and their uses. In: 2Nd international conference document analysis and recognition, ICDAR ’93, october 20-22, 1993, tsukuba city, japan, pp 62–67
Bengio Y, Simard P, Frasconi P (2002) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article Google Scholar
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Article Google Scholar
Breuel TM (2008) The ocropus open source OCR system. In: Document recognition and retrieval XV, part of the IS&T-SPIE Electronic Imaging Symposium, San Jose, CA, USA, January 29-31, 2008. Proceedings, p 68150F
Chrupala G, Gelderloos L, Alishahi A (2017) Representations of language in a model of visually grounded speech signal. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp 613– 622
Firmani D, Merialdo P, Nieddu E, Scardapane S (2017) In codice ratio: OCR of handwritten latin documents using deep convolutional networks. In: International workshop on artificial intelligence for cultural heritage, pp 9–16
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: International conference on neural information processing systems, pp 2121–2129
Graves A, Gomez F (2006) Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks. In: International conference on machine learning, pp 369–376
Graves A, Liwicki M, Fernndez S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855
Article Google Scholar
Hardoon DR, Szedmak S, Shawe-Taylor J (2014) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37 (9):1904–1916
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models Computer Science
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE Conference on computer vision and pattern recognition, pp 4437–4446
Knight K, Nenkova A, Rambow O (eds) (2016) NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. The Association for Computational Linguistics, San Diego
Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: International conference on neural information processing systems, pp 1097–1105
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp 1188–1196
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, Berkeley, CA, USA, November 2-8, 2003, pp 604–611
Li N, Tsang IW, Zhou ZH (2013) Efficient optimization of performance measures by classifier adaptation. IEEE Trans Pattern Anal Mach Intell 35(6):1370–1382
Article Google Scholar
Liu Y, Zhao WL, Ngo CW, Xu CS, Lu HQ (2010) Coherent bag-of audio words model for efficient large-scale video copy detection. In: Acm international conference on image & video retrieval, pp 89–96
Long M, Cao Y, Wang J, Jordan MI (2015) Learning transferable features with deep adaptation networks. In: International conference on international conference on machine learning, pp 97–105
Mihalcea R, Tarau P (2004) Textrank: Bringing order into texts. Emnlp, pp 404–411
Naz S, Umar AI, Shirazi SH, Ajmal MM (2014) Salahuddin: The optical character recognition for cursive script using hmm: A review. Res J Appl Sci Eng Technol 8(19):2016–2025
Article Google Scholar
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2009) Multimodal deep learning. In: International conference on machine learning, ICML 2011, bellevue, washington, usa, june 28 - july, pp 689–696
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Computer vision and pattern recognition, pp 4594–4602
Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: concepts, methodologies, benchmarks and challenges. IEEE Trans Circuits Syst Video Technol PP(99):1–1
Google Scholar
Philip B, Samuel RDS (2009) A novel bilingual ocr system based on column-stochastic features and svm classifier for the specially enabled. In: Second international conference on emerging trends in engineering & technology, pp 252–257
Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: International conference on multimedia, pp 251–260
Silberer C, Lapata M (2014) Learning grounded meaning representations with autoencoders. In: Meeting of the association for computational linguistics, pp 721–732
Smith R, Antonova D, Lee DS (2009) Adapting the tesseract open source ocr engine for multilingual ocr. In: International workshop on multilingual ocr, p 1
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. TACL 2:207–218
Google Scholar
Song R, Umemoto K, Nie J, Xie X, Tanaka K, Rui Y (2016) Uniclip: Leveraging web search for universal clipping of articles on mobile. Data Science and Engineering 1(2):101–113
Article Google Scholar
Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. In: International conference on machine learning
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016, las vegas, NV, USA, June 27-30, 2016, pp 2818–2826
Ul-Hasan A, Breuel TM (2013) Can we build language-independent ocr using lstm networks?. In: International workshop on multilingual ocr, p 9
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE Conference on computer vision and pattern recognition, CVPR 2015, boston, MA, USA, June 7-12, 2015, pp 3156–3164
Wang D, Cui P, Ou M, Zhu W (2015) Deep multimodal hashing with orthogonal regularization. In: International conference on artificial intelligence, pp 2291–2297
Wang Y, Lin X, Wu L, Zhang W (2015) Effective multi-query expansions: Robust landmark retrieval. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, pp 79–88
Wang Y, Lin X, Wu L, Zhang W, Zhang Q, Huang X (2015) Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Trans Image Process 24(11):3939–3949
Article MathSciNet Google Scholar
Wang Y, Lin X, Wu L, Zhang W (2017) Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans Image Processing 26(3):1393–1404
Article MathSciNet Google Scholar
Wang Y, Zhang W, Wu L, Lin X, Zhao X (2017) Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion. IEEE Trans Neural Netw Learn Syst 28(1):57–70
Article Google Scholar
Weston J, Bengio S, Usunier N (2011) Wsabie: scaling up to large vocabulary image annotation. In: International joint conference on artificial intelligence, pp 2764–2770
Wu L, Wang Y, Gao J, Li X (2017) Deep adaptive feature embedding with local sample distributions for person re-identification. Pattern Recogn 73:275–288
Article Google Scholar
Wu L, Wang Y, Li X, Gao J (2017) What-and-where to match: Deep spatially multiplicative integration networks for person re-identification Pattern Recognition
Wu L, Wang Y, Li X, Gao J (2018) Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE Transactions on Cybernetics PP (99):1–12
Google Scholar
Yang Y, Zhan D, Fan Y, Jiang Y, Zhou Z (2017) Deep learning for fixed model reuse. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp 2831–2837
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks?. In: Advances in neural information processing systems 27: Annual conference on neural information processing systems 2014, december 8-13 2014, montreal, quebec, canada, pp 3320–3328
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2:67–78
Google Scholar
Zhou ZH (2016) Learnware: on the future of machine learning. Springer, New York
Google Scholar

Download references

Acknowledgements

This work is supported by the National Social Science Foundation of China (Grant No: 15BGL048), Hubei Province Science and Technology Support Project (Grant No: 2015BAA072), Hubei Provincial Natural Science Foundation of China (Grant No: 2017CFA012), The Fundamental Research Funds for the Central Universities (WUT: 2017II39GX).

Author information

Authors and Affiliations

School of Computer Science and Technology, Wuhan University of Technology, Wuhan Shi, China
Zhongwei Xie, Lin Li, Xian Zhong, Yang He & Luo Zhong

Authors

Zhongwei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Lin Li
View author publications
You can also search for this author in PubMed Google Scholar
Xian Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Yang He
View author publications
You can also search for this author in PubMed Google Scholar
Luo Zhong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhongwei Xie.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xie, Z., Li, L., Zhong, X. et al. Enhancing multimodal deep representation learning by fixed model reuse. Multimed Tools Appl 78, 30769–30791 (2019). https://doi.org/10.1007/s11042-018-6556-6

Download citation

Received: 30 July 2018
Revised: 09 August 2018
Accepted: 15 August 2018
Published: 06 September 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s11042-018-6556-6

Enhancing multimodal deep representation learning by fixed model reuse

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Convolutional Neural Network for Bidirectional Image-Sentence Mapping

Learning to Learn from Web Data Through Deep Semantic Embeddings

A deep semantic framework for multimodal representation learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Enhancing multimodal deep representation learning by fixed model reuse

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Convolutional Neural Network for Bidirectional Image-Sentence Mapping

Learning to Learn from Web Data Through Deep Semantic Embeddings

A deep semantic framework for multimodal representation learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation