Abstract
As we all know, inconsistent distribution and representation of different modalities, such as image, text and audio, cause the “media gap”, which poses a great challenge to deal with such heterogeneous data. Currently, state-of-the-art multimodal approaches mainly focus on the data provided by target task, neglecting the extra information on different but related tasks. In this paper, we explore a multimodal representation learning architecture by leveraging embedding representation trained from extra information. Specifically speaking, the approach of fixed model reuse is integrated into our architecture, which can incorporate helpful information from existing models/features into a new model. Based on our proposed architecture, we study multilingual OCR and long-text-based image retrieval tasks. Multilingual OCR is a difficult task that deals with multiple languages on the same page. We take advantage of extra textual embedding layer in an existing text-generating model to improve the accuracy of multilingual OCR. As for the long-text-based image retrieval, a cross-modal task, intermediate visual embedding layer in an off-the-shelf image-captioning model is leveraged to enhance the retrieval ability. The experimental results validate the effectiveness of our proposed architecture on narrowing down the “media gap” and yield observable improvement in these two tasks. Our architecture outperform the state-of-the-art approaches by 4.2% improvements in terms of accuracy in multilingual OCR task and yields improvement from 9 to 6 with regard to the median rank of retrieval result in long-text-based image retrieval task.
Similar content being viewed by others
References
Ah-Pine J, Csurka G (2011) Semantic combination of textual and visual information in multimedia retrieval. In: ACM International conference on multimedia retrieval, p 44
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on international conference on machine learning, pp III–1247
Baird HS (1993) Document image defect models and their uses. In: 2Nd international conference document analysis and recognition, ICDAR ’93, october 20-22, 1993, tsukuba city, japan, pp 62–67
Bengio Y, Simard P, Frasconi P (2002) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Breuel TM (2008) The ocropus open source OCR system. In: Document recognition and retrieval XV, part of the IS&T-SPIE Electronic Imaging Symposium, San Jose, CA, USA, January 29-31, 2008. Proceedings, p 68150F
Chrupala G, Gelderloos L, Alishahi A (2017) Representations of language in a model of visually grounded speech signal. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp 613– 622
Firmani D, Merialdo P, Nieddu E, Scardapane S (2017) In codice ratio: OCR of handwritten latin documents using deep convolutional networks. In: International workshop on artificial intelligence for cultural heritage, pp 9–16
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: International conference on neural information processing systems, pp 2121–2129
Graves A, Gomez F (2006) Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks. In: International conference on machine learning, pp 369–376
Graves A, Liwicki M, Fernndez S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855
Hardoon DR, Szedmak S, Shawe-Taylor J (2014) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37 (9):1904–1916
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models Computer Science
Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE Conference on computer vision and pattern recognition, pp 4437–4446
Knight K, Nenkova A, Rambow O (eds) (2016) NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. The Association for Computational Linguistics, San Diego
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: International conference on neural information processing systems, pp 1097–1105
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pp 1188–1196
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the Eleventh ACM International Conference on Multimedia, Berkeley, CA, USA, November 2-8, 2003, pp 604–611
Li N, Tsang IW, Zhou ZH (2013) Efficient optimization of performance measures by classifier adaptation. IEEE Trans Pattern Anal Mach Intell 35(6):1370–1382
Liu Y, Zhao WL, Ngo CW, Xu CS, Lu HQ (2010) Coherent bag-of audio words model for efficient large-scale video copy detection. In: Acm international conference on image & video retrieval, pp 89–96
Long M, Cao Y, Wang J, Jordan MI (2015) Learning transferable features with deep adaptation networks. In: International conference on international conference on machine learning, pp 97–105
Mihalcea R, Tarau P (2004) Textrank: Bringing order into texts. Emnlp, pp 404–411
Naz S, Umar AI, Shirazi SH, Ajmal MM (2014) Salahuddin: The optical character recognition for cursive script using hmm: A review. Res J Appl Sci Eng Technol 8(19):2016–2025
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2009) Multimodal deep learning. In: International conference on machine learning, ICML 2011, bellevue, washington, usa, june 28 - july, pp 689–696
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: Computer vision and pattern recognition, pp 4594–4602
Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: concepts, methodologies, benchmarks and challenges. IEEE Trans Circuits Syst Video Technol PP(99):1–1
Philip B, Samuel RDS (2009) A novel bilingual ocr system based on column-stochastic features and svm classifier for the specially enabled. In: Second international conference on emerging trends in engineering & technology, pp 252–257
Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: International conference on multimedia, pp 251–260
Silberer C, Lapata M (2014) Learning grounded meaning representations with autoencoders. In: Meeting of the association for computational linguistics, pp 721–732
Smith R, Antonova D, Lee DS (2009) Adapting the tesseract open source ocr engine for multilingual ocr. In: International workshop on multilingual ocr, p 1
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. TACL 2:207–218
Song R, Umemoto K, Nie J, Xie X, Tanaka K, Rui Y (2016) Uniclip: Leveraging web search for universal clipping of articles on mobile. Data Science and Engineering 1(2):101–113
Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. In: International conference on machine learning
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on computer vision and pattern recognition, CVPR 2016, las vegas, NV, USA, June 27-30, 2016, pp 2818–2826
Ul-Hasan A, Breuel TM (2013) Can we build language-independent ocr using lstm networks?. In: International workshop on multilingual ocr, p 9
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE Conference on computer vision and pattern recognition, CVPR 2015, boston, MA, USA, June 7-12, 2015, pp 3156–3164
Wang D, Cui P, Ou M, Zhu W (2015) Deep multimodal hashing with orthogonal regularization. In: International conference on artificial intelligence, pp 2291–2297
Wang Y, Lin X, Wu L, Zhang W (2015) Effective multi-query expansions: Robust landmark retrieval. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane, Australia, October 26 - 30, 2015, pp 79–88
Wang Y, Lin X, Wu L, Zhang W, Zhang Q, Huang X (2015) Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Trans Image Process 24(11):3939–3949
Wang Y, Lin X, Wu L, Zhang W (2017) Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans Image Processing 26(3):1393–1404
Wang Y, Zhang W, Wu L, Lin X, Zhao X (2017) Unsupervised metric fusion over multiview data by graph random walk-based cross-view diffusion. IEEE Trans Neural Netw Learn Syst 28(1):57–70
Weston J, Bengio S, Usunier N (2011) Wsabie: scaling up to large vocabulary image annotation. In: International joint conference on artificial intelligence, pp 2764–2770
Wu L, Wang Y, Gao J, Li X (2017) Deep adaptive feature embedding with local sample distributions for person re-identification. Pattern Recogn 73:275–288
Wu L, Wang Y, Li X, Gao J (2017) What-and-where to match: Deep spatially multiplicative integration networks for person re-identification Pattern Recognition
Wu L, Wang Y, Li X, Gao J (2018) Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE Transactions on Cybernetics PP (99):1–12
Yang Y, Zhan D, Fan Y, Jiang Y, Zhou Z (2017) Deep learning for fixed model reuse. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., pp 2831–2837
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks?. In: Advances in neural information processing systems 27: Annual conference on neural information processing systems 2014, december 8-13 2014, montreal, quebec, canada, pp 3320–3328
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2:67–78
Zhou ZH (2016) Learnware: on the future of machine learning. Springer, New York
Acknowledgements
This work is supported by the National Social Science Foundation of China (Grant No: 15BGL048), Hubei Province Science and Technology Support Project (Grant No: 2015BAA072), Hubei Provincial Natural Science Foundation of China (Grant No: 2017CFA012), The Fundamental Research Funds for the Central Universities (WUT: 2017II39GX).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xie, Z., Li, L., Zhong, X. et al. Enhancing multimodal deep representation learning by fixed model reuse. Multimed Tools Appl 78, 30769–30791 (2019). https://doi.org/10.1007/s11042-018-6556-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6556-6