Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3340531.3412783acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

Published: 19 October 2020 Publication History

Abstract

In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. A second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalising on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding.

Supplementary Material

MP4 File (3340531.3412783.mp4)
This video is a presentation of the paper ?MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities?. MLM is a resource for training and evaluating multitask systems on diverse data. We also present the generation process for the dataset, a set of benchmark evaluation tasks, and a multitask machine learning framework. Please find more information on the resource and project at http://cleopatra.ijs.si/goal-mlm/.

References

[1]
Beatrice Alex, Kate Byrne, Claire Grover, and Richard Tobin. 2015. Adapting the Edinburgh geoparser for historical georeferencing. International Journal of Humanities and Arts Computing 9, 1 (2015), 15--35.
[2]
Mehdi Ali, Hajira Jabeen, Charles Tapley Hoyt, and Jens Lehmann. 2019. The KEEN Universe. In International Semantic Web Conference. Springer, 3--18.
[3]
Yusuf Aytar, Mubarak Shah, and Jiebo Luo. 2008. Utilizing semantic word similarity measures for video retrieval. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.
[4]
Georges Baatz, Olivier Saurer, Kevin Köser, and Marc Pollefeys. 2012. Large scale visual geo-localization of images in mountainous terrain. In European conference on computer vision. Springer, 517--530.
[5]
Tadas Baltruaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 423--443.
[6]
Tilman Becker, Edward Curry, Anja Jentzsch, and Walter Palmetshofer. 2016. New Horizons for a Data-Driven Economy: Roadmaps and Action Plans for Technology, Businesses, Policy, and Society. In New Horizons for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in Europe, José María Cavanillas, Edward Curry, and Wolfgang Wahlster (Eds.). Springer International Publishing, Cham, 277--291. https://doi.org/10.1007/978--3--319--21569--3_16
[7]
Yoshua Bengio. 2009. Learning Deep Architectures for AI. Foundations and Trends in Machine Learning 2 (2009), 71.
[8]
Alexander Binder, Wojciech Samek, Klaus-Robert Müller, and Motoaki Kawanabe. 2013. Enhanced representation and multi-task learning for image annotation. Computer Vision and Image Understanding 117, 5 (2013), 466--478.
[9]
Nicolas Blanc, Timothée Produit, and Jens Ingensand. 2018. A semi-automatic tool to georeference historical landscape images. Technical Report. PeerJ Preprints.
[10]
Jan Brejcha. 2017. State-of-the-art in visual geo-localization. Pattern Analysis and Applications 20, 3 (2017), 613--637.
[11]
Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo, Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, and Aaron C. Courville. 2018. HoME: a Household Multimodal Environment. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. OpenReview.net. https://openreview.net/forum? id=B1pJ3dkwG
[12]
Barbara Caputo, Henning Müller, Jesus Martinez-Gomez, Mauricio Villegas, Burak Acar, Novi Patricia, Neda Marvasti, Suzan Üsküdarl?, Roberto Paredes, Miguel Cazorla, et al. 2014. ImageCLEF 2014: Overview and analysis of the results. In International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 192--211.
[13]
Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41--75.
[14]
David M Chen, Georges Baatz, Kevin Köser, Sam S Tsai, Ramakrishna Vedantham, Timo Pylvänäinen, Kimmo Roimela, Xin Chen, Jeff Bach, Marc Pollefeys, et al. 2011. City-scale landmark identification on mobile devices. In CVPR 2011. IEEE, 737--744.
[15]
Jaeyoung Choi, Claudia Hauff, Olivier Van Laere, and Bart Thomee. 2015. The placing task at MediaEval 2015. In MediaEval 2015, Wurzen, Germany, 14--15 September 2015; Ceur Workshop Proceedings 1436, 2015. CEUR.
[16]
Jaeyoung Choi, Martha Larson, Gerald Friedland, and Alan Hanjalic. 2019. From Intra-Modal to Inter-Modal Space: Multi-task Learning of Shared Representations for Cross-Modal Retrieval. In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). IEEE, Singapore, Singapore, 1--10. https://doi.org/10.1109/BigMM.2019.00--48
[17]
Grace Chu, Brian Potetz, Weijun Wang, Andrew Howard, Yang Song, Fernando Brucher, Thomas Leung, and Hartwig Adam. 2019. Geo-Aware Networks for Fine-Grained Recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 0--0.
[18]
Bertrand Delezoide and Hervé Le Borgne. 2007. SemanticVox: A multilingual video search engine. In Proceedings of the 6th ACM international conference on Image and video retrieval. 81--84.
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://doi.org/10.18653/v1/N19--1423
[20]
Mouna Harrach, Alexandre Devaux, and Mathieu Brédif. 2019. Interactive Image Geolocalization in an Immersive Web Application. International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences (2019).
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630--645.
[22]
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. 2020. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080 (2020).
[23]
Alexis Joly, Hervé Goëau, Hervé Glotin, Concetto Spampinato, Pierre Bonnet, Willem-Pier Vellinga, Julien Champ, Robert Planqué, Simone Palazzo, and Henning Müller. 2016. LifeCLEF 2016: multimedia life species identification challenges. In International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 286--310.
[24]
Hyo Jin Kim, Enrique Dunn, and Jan-Michael Frahm. 2017. Learned contextual feature reweighting for image geo-localization. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3251--3260.
[25]
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Ioannis Kompatsiaris. 2017. Geotagging text content with language models and feature mining. Proc. IEEE 105, 10 (2017), 1971--1986.
[26]
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2016. In-depth exploration of geotagging performance using sampling strategies on YFCC100M. In Proceedings of the 2016 ACM Workshop on Multimedia COMMONS. 3--10.
[27]
Tomasz Kornuta, Deepta Rajan, Chaitanya Shivade, Alexis Asseman, and Ahmet S. Ozcan. 2019. Leveraging Medical Visual Question Answering with Supporting Facts. In Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9--12, 2019 (CEUR Workshop Proceedings, Vol. 2380), Linda Cappellato, Nicola Ferro, David E. Losada, and Henning Müller (Eds.). CEUR-WS.org. http://ceur-ws.org/Vol-2380/paper_112.pdf
[28]
Ryohei Kuga, Asako Kanezaki, Masaki Samejima, Yusuke Sugano, and Yasuyuki Matsushita. 2017. Multi-task learning using multi-modal encoder-decoder networks with shared skip connections. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 403--411.
[29]
Martha Larson, Mohammad Soleymani, Guillaume Gravier, Bogdan Ionescu, and Gareth JF Jones. 2017. The benchmarking initiative for multimedia evaluation: MediaEval 2016. IEEE MultiMedia 24, 1 (2017), 93--96.
[30]
Qing Li, Qingyi Tao, Shafiq Joty, Jianfei Cai, and Jiebo Luo. 2018. Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions. In Proceedings of the European Conference on Computer Vision (ECCV). 552--567.
[31]
Ying Lin, Shengqi Yang, Veselin Stoyanov, and Heng Ji. 2018. A multi-lingual multi-task architecture for low-resource sequence labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 799--809.
[32]
Mingsheng Long, Zhangjie Cao, Jianmin Wang, and S Yu Philip. 2017. Learning multiple tasks with multilinear relationship networks. In Advances in neural information processing systems. 1594--1603.
[33]
Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. 2018. Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3219--3232. https://doi.org/10.18653/v1/D18--1360
[34]
Junyu Luo, Ying Shen, Xiang Ao, Zhou Zhao, and Min Yang. 2019. Cross-modal Image-Text Retrieval with Multitask Learning. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2309--2312.
[35]
Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1m+: A dataset for learning cross-modal embeddings for cooking recipes and food images. IEEE transactions on pattern analysis and machine intelligence (2019).
[36]
Stuart E Middleton, Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2018. Location extraction from social media: Geoparsing, location disambiguation, and geotagging. ACM Transactions on Information Systems (TOIS) 36, 4 (2018), 1--27.
[37]
Ludovic Moncla, Mauro Gaio, Thierry Joliveau, and Yves-François Le Lay. 2017. Automated geoparsing of Paris street names in 19th century novels. In Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities. 1--8.
[38]
Eric Muller-Budack, Kader Pustu-Iren, and Ralph Ewerth. 2018. Geolocation estimation of photos using a hierarchical model and scene classification. In Proceedings of the European Conference on Computer Vision (ECCV). 563--579.
[39]
Binh D Nguyen, Thanh-Toan Do, Binh X Nguyen, Tuong Do, Erman Tjiputra, and Quang D Tran. 2019. Overcoming Data Limitation in Medical Visual Question Answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 522--530.
[40]
Luc Pauwels. 2012. A multimodal framework for analyzing websites as cultural expressions. Journal of Computer-Mediated Communication 17, 3 (2012), 247--265.
[41]
Miguel De Prado, Jing Su, Rabia Saeed, Lorenzo Keller, Noelia Vallez, Andrew Anderson, David Gregg, Luca Benini, Tim Llewellynn, Nabil Ouerhani, Rozenn Dahyot, and Nuria Pazos. 2020. Bonseyes AI Pipeline?Bringing AI to You: End-toEnd Integration of Data, Algorithms, and Deployment Tools. ACM Trans. Internet Things 1, 4, Article 26 (Aug. 2020), 25 pages. https://doi.org/10.1145/3403572
[42]
Arnau Ramisa, Fei Yan, Francesc Moreno-Noguer, and Krystian Mikolajczyk. 2017. Breakingnews: Article annotation by image and text processing. IEEE transactions on pattern analysis and machine intelligence 40, 5 (2017), 1072--1085.
[43]
Shreya Shankar, Yoni Halpern, Eric Breck, James Atwood, Jimbo Wilson, and D. Sculley. 2017. No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World. In NIPS 2017 workshop: Machine Learning for the Developing World.
[44]
Devendra Singh Chaplot, Lisa Lee, Ruslan Salakhutdinov, Devi Parikh, and Dhruv Batra. 2019. Embodied Multimodal Multitask Learning. arXiv preprint arXiv:1902.01385 (2019).
[45]
Harini Suresh and John V Guttag. 2019. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002 (2019).
[46]
Kevin Tang, Manohar Paluri, Li Fei-Fei, Rob Fergus, and Lubomir Bourdev. 2015. Improving image classification with location context. In Proceedings of the IEEE international conference on computer vision. 1008--1016.
[47]
Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64--73.
[48]
Michele Trevisiol, Hervé Jégou, Jonathan Delhumeau, and Guillaume Gravier. 2013. Retrieving geo-location of videos with a divide & conquer hierarchical multimodal approach. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval. 1--8.
[49]
Theodora Tsikrika, Adrian Popescu, and Jana Kludas. 2011. Overview of the Wikipedia Image Retrieval Task at ImageCLEF 2011. In CLEF (Notebook Papers/Labs/Workshop), Vol. 4. 5.
[50]
Burak Uzkent, Evan Sheehan, Chenlin Meng, Zhongyi Tang, Marshall Burke, David B Lobell, and Stefano Ermon. 2019. Learning to Interpret Satellite Images using Wikipedia. In IJCAI. 3620--3626.
[51]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. VATEX: A large-scale, high-quality multilingual dataset for videoand-language research. In Proceedings of the IEEE International Conference on Computer Vision. 4581--4591.
[52]
Tobias Weyand, Ilya Kostrikov, and James Philbin. 2016. Planet-photo geolocation with convolutional neural networks. In European Conference on Computer Vision. Springer, 37--55.
[53]
Xue-Wen Chen and Xiaotong Lin. 2014. Big Data Deep Learning: Challenges and Perspectives. IEEE Access 2 (2014), 514--525. https://doi.org/10.1109/ACCESS. 2014.2325029
[54]
Jie Yu and Jiebo Luo. 2008. Leveraging probabilistic season and location context models for scene understanding. In Proceedings of the 2008 international conference on Content-based image and video retrieval. 169--178.
[55]
Yu Zhang and Qiang Yang. 2018. A Survey on Multi-Task Learning. arXiv:1707.08114 [cs] (July 2018). http://arxiv.org/abs/1707.08114 arXiv: 1707.08114.
[56]
Yan-Tao Zheng, Ming Zhao, Yang Song, Hartwig Adam, Ulrich Buddemeier, Alessandro Bissacco, Fernando Brucher, Tat-Seng Chua, and Hartmut Neven. 2009. Tour the world: building a web-scale landmark recognition engine. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1085--1092.
[57]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40, 6 (2017), 1452--1464.
[58]
Mingyang Zhou, Runxiang Cheng, Yong Jae Lee, and Zhou Yu. 2018. A visual attention grounding neural model for multimodal machine translation. arXiv preprint arXiv:1808.08266 (2018).

Cited By

View all
  • (2024)SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01029(10822-10832)Online publication date: 16-Jun-2024
  • (2023)Multimodal Geolocation Estimation of News PhotosAdvances in Information Retrieval10.1007/978-3-031-28238-6_14(204-220)Online publication date: 2-Apr-2023
  • (2023)MM-Locate-News: Multimodal Focus Location Estimation in NewsMultiMedia Modeling10.1007/978-3-031-27077-2_16(204-216)Online publication date: 9-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. machine learning
  2. multilingual data
  3. multimodal data
  4. multitask learning

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)41
  • Downloads (Last 6 weeks)6
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01029(10822-10832)Online publication date: 16-Jun-2024
  • (2023)Multimodal Geolocation Estimation of News PhotosAdvances in Information Retrieval10.1007/978-3-031-28238-6_14(204-220)Online publication date: 2-Apr-2023
  • (2023)MM-Locate-News: Multimodal Focus Location Estimation in NewsMultiMedia Modeling10.1007/978-3-031-27077-2_16(204-216)Online publication date: 9-Jan-2023
  • (2021)WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine LearningProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3463257(2443-2449)Online publication date: 11-Jul-2021
  • (2021)GeoWINE: Geolocation based Wiki, Image, News and Event RetrievalProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3404835.3462786(2565-2569)Online publication date: 11-Jul-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media