Abstract
The inclusion of visually impaired people to daily life is a challenging and active area of research. This work studies how to bring information about the surroundings to people delivered as verbal descriptions in Spanish using wearable devices. We use a neural network (DenseCap) for both identifying objects and generating phrases about them. DenseCap is running on a server to describe an image fed from a smartphone application, and its output is the text which a smartphone verbalizes. Our implementation achieves a mean Average Precision (mAP) of 5.0 in object recognition and quality of captions and takes an average of 7.5 s from the moment one grabs a picture until one receives the verbalization in Spanish.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y.: From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv:1511.03292v1 (2015)
Atkinson, K.: GNU Aspell. http://aspell.net/. Accessed 08 Jan 2018
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Workshop on Statistical Machine Translation (2014)
Eco, U.: Tratado de semiótica General. Debolsillo, Madrid (2008)
Eslami, S., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Kavukcuoglu, K., Hinton, G.: Attend, infer, repeat: fast scene understanding with generative models. arXiv:1603.08575 (2016)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Greene, M., Botros, A., Beck, D., Fei-Fei, L.: What you see is what you expect: rapid scene understanding benefits from prior experience. Attent. Percept. Psychophys. 77(4), 1239–1251 (2015)
Helcl, J., Libovický, J.: CUNI system for the WMT17 multimodal translation task. arXiv:1707.04550 (2017)
Hitschler, J., Schamoni, S., Riezler, S.: Multimodal pivots for image caption translation. arXiv:1601.03916v3 (2016)
Instituto Nacional de Estadística y Geografía: Estadísticas a propósito del día internacional de las personas con discapacidad. http://tinyurl.com/discapacidad. Accessed 15 Dec 2017
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: IEEE CVPR, pp. 4565–4574 (2016)
Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: IEEE CVPR (2015)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE CVPR (2015)
Kiros, J., Salakhutdinov, R., Zemel, R.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539v1 (2014)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Jia-Li, L., Shamma, D., Bernstein, M., Fei-Fei, L.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV (2016)
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A., Berg, T.: Baby talk: understanding and generating simple image descriptions. In: IEEE CVPR (2011)
Lan, W., Li, X., Dong, J.: Fluency-guided cross-lingual image captioning. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 1549–1557 (2017)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Leo, M., Medioni, G., Trivedi, M., Kanade, T., Farinella, G.: Computer vision for assistive technologies. Comput. Vis. Image Underst. 154, 1–15 (2017)
Li, L.J., Socher, R., Fei-Fei, L.: Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In: IEEE CVPR (2009)
Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Conference on Computational Natural Language Learning (2011)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (M-RNN). In: ICLR (2015)
Miyazaki, T., Shimizu, N.: Cross-lingual image caption generation. In: Annual Meeting of the Association for Computational Linguistics, pp. 1780–1790 (2016)
Nisbet, R., Elder, J., Miner, G.: Handbook of Statistical Analysis and Data Mining Applications. Elsevier Inc., Amsterdam (2009)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556v6 (2015)
Tian, Y., Yang, X., Yi, C., Arditi, A.: Toward a computer vision-based wayfinding aid for blind persons to access unfamiliar indoor environments. Mach. Vis. Appl. 24(3), 521–535 (2013)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. arXiv:1411.4555v2 (2014)
Wei, Q., Wang, X., Li, X.: Harvesting deep models for cross-lingual image annotation. In: Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing (2017). http://doi.acm.org/10.1145/3095713.3095751
World Health Organization: global data on visual impairments 2010. https://tinyurl.com/globaldata2010. Accessed 29 Jan 2018
World Health Organization: visual impairment and blindness. http://tinyurl.com/impaired. Accessed 08 Dec 2017
Yao, B., Yang, X., Lin, L., Lee, M., Zhu, S.: I2T: image parsing to text description. Proc. IEEE 98, 1485–1508 (2010)
Yoshikawa, Y., Shigeto, Y., Takeuchi, A.: Stair captions: constructing a large-scale japanese image caption dataset. arXiv:1705.00823v1 (2017)
Acknowledgments
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPU Tesla K40 used for this research. Rodrigo Carrillo, Miguel Torres, and Luis Sáenz developed the Android application. This work was partially funded by SIP-IPN 20180779 for Joaquín Salas. Bogdan Raducanu is supported by Grant No. TIN2016-79717-R, funded by MINECO, Spain. Alejandro Gomez-Garay is supported by Grant No. 434110/618827, funded by CONACyT.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Gomez-Garay, A., Raducanu, B., Salas, J. (2018). Dense Captioning of Natural Scenes in Spanish. In: Martínez-Trinidad, J., Carrasco-Ochoa, J., Olvera-López, J., Sarkar, S. (eds) Pattern Recognition. MCPR 2018. Lecture Notes in Computer Science(), vol 10880. Springer, Cham. https://doi.org/10.1007/978-3-319-92198-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-92198-3_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92197-6
Online ISBN: 978-3-319-92198-3
eBook Packages: Computer ScienceComputer Science (R0)