Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3404835.3463260acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Open access

WTR: A Test Collection for Web Table Retrieval

Published: 11 July 2021 Publication History

Abstract

We describe the development, characteristics and availability of a test collection for the task of Web table retrieval, which uses a large-scale Web Table Corpora extracted from the Common Crawl. Since a Web table usually has rich context information such as the page title and surrounding paragraphs, we not only provide relevance judgments of query-table pairs, but also the relevance judgments of query-table context pairs with respect to a query, which are ignored by previous test collections. To facilitate future research with this benchmark, we provide details about how the dataset is pre-processed and also baseline results from both traditional and recently proposed table retrieval methods. Our experimental results show that proper usage of context labels can benefit previous table retrieval methods.

Supplementary Material

MP4 File (zoom_1.mp4)
We describe the development, characteristics and availability of atest collection for the task of Web table retrieval, which uses a large-scale Web Table Corpora extracted from the Common Crawl. Sincea Web table usually has rich context information such as the pagetitle and surrounding paragraphs, we not only provide relevancejudgments of query-table pairs, but also the relevance judgmentsof query-table context pairs with respect to a query, which areignored by previous test collections. To facilitate future researchwith this benchmark, we provide details about how the dataset ispre-processed and also baseline results from both traditional andrecently proposed table retrieval methods. Our experimental resultsshow that proper usage of context labels can benefit previous tableretrieval methods.

References

[1]
Krisztian Balog and Robert Neumayer. 2013. A test collection for entity search in DBpedia. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 737--740.
[2]
Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2013. Methods for exploring and mining tables on Wikipedia. In Proceedings of the ACM SIGKDD workshop on Interactive Data Exploration and Analytics. 18--26.
[3]
Michael J Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data integration for the relational web. Proceedings of the VLDB Endowment, Vol. 2, 1 (2009), 1090--1101.
[4]
Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: Exploring the power of tables on the Web. Proceedings of the VLDB Endowment, Vol. 1, 1 (2008), 538--549.
[5]
Jing Chen, Chenyan Xiong, and Jamie Callan. 2016. An empirical study of learning to rank for entity search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 737--740.
[6]
Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Wang. 2020 d. HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data. Findings of the Association for Computational Linguistics (EMNLP) (2020), 1026--1036.
[7]
Zhiyu Chen, H. Eavani, Yinyin Liu, and William Yang Wang. 2020 a. Few-shot NLG with Pre-trained Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 183--190.
[8]
Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D. Davison. 2018. Generating schema labels through dataset content analysis. In Companion Proceedings of the The Web Conference. 1515--1522.
[9]
Zhiyu Chen, Haiyan Jia, Jeff Heflin, and Brian D Davison. 2020 b. Leveraging schema labels to enhance dataset search. In European Conference on Information Retrieval. Springer, 267--280.
[10]
Zhiyu Chen, Mohamed Trabelsi, Jeff Heflin, Yinan Xu, and Brian D Davison. 2020 c. Table search using a deep contextualized language model. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 589--598.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. 4171--4186.
[12]
Sainyam Galhotra and Udayan Khurana. 2020. Semantic Search over Structured Data. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 3381--3384.
[13]
Faegheh Hasibi, Krisztian Balog, Dar'io Garigliotti, and Shuo Zhang. 2017a. Nordlys: A toolkit for entity-oriented and semantic search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1289--1292.
[14]
Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. 2017b. DBpedia-entity v2: a test collection for entity search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1265--1268.
[15]
Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, cC agatay Demiralp, and César Hidalgo. 2019. Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1500--1508.
[16]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations .
[17]
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, Vol. 7 (2019), 453--466.
[18]
Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia--a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, Vol. 6, 2 (2015), 167--195.
[19]
Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2018. Table-to-text generation by structure-aware seq2seq learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[20]
Xiaodong Liu, Pengcheng He, W. Chen, and Jianfeng Gao. 2019. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4487--4496.
[21]
Matthew E. Peters, Mark Neumann, Robert L Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. 2019. Knowledge Enhanced Contextual Word Representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 43--54.
[22]
Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web using Column Keywords. Proceedings of the VLDB Endowment, Vol. 5 (2012), 908--919.
[23]
Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond .Now Publishers Inc.
[24]
Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management. 42--49.
[25]
Roee Shraga, Haggai Roitman, Guy Feigenblat, and Mustafa Cannim. 2020. Web Table Retrieval using Multimodal Deep Learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1399--1408.
[26]
Trevor Standley, Amir Zamir, Dawn Chen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. 2020. Which tasks should be learned together in multi-task learning?. In International Conference on Machine Learning. PMLR, 9120--9132.
[27]
Huan Sun, Hao Ma, Xiaodong He, Wen-tau Yih, Yu Su, and Xifeng Yan. 2016. Table cell search for question answering. In Proceedings of the 25th International Conference on World Wide Web. 771--782.
[28]
Yibo Sun, Zhao Yan, Duyu Tang, Nan Duan, and Bing Qin. 2019. Content-based table retrieval for web queries. Neurocomputing, Vol. 349 (2019), 183--189.
[29]
Mohamed Trabelsi, Zhiyu Chen, Brian D Davison, and Jeff Heflin. 2020 a. A hybrid deep model for learning to rank data tables. In IEEE International Conference on Big Data (Big Data). IEEE, 979--986.
[30]
Mohamed Trabelsi, Zhiyu Chen, Brian D. Davison, and Jeff Heflin. 2020 b. Relational Graph Embeddings for Table Retrieval. In IEEE International Conference on Big Data (Big Data). IEEE, 3005--3014.
[31]
Mohamed Trabelsi, Brian D Davison, and Jeff Heflin. 2019. Improved table retrieval using multiple context embeddings for attributes. In IEEE International Conference on Big Data (Big Data). IEEE, 1238--1244.
[32]
Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pacsca, Warren Shen, Fei Wu, Gengxin Miao, and Chung Wu. 2011. Recovering Semantics of Tables on the Web. Proc. VLDB Endow., Vol. 4, 9 (June 2011), 528--538. https://doi.org/10.14778/2002938.2002939
[33]
Robert H. Warren and Frank Wm. Tompa. 2006. Multi-column substring matching for database schema translation. In Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment, 331--342.
[34]
Chuan Xiao, Wei Wang, Xuemin Lin, and Haichuan Shang. 2009. Top-k set similarity joins. In IEEE 25th International Conference on Data Engineering. IEEE, 916--927.
[35]
Yang Yi, Zhiyu Chen, Jeff Heflin, and Brian D Davison. 2018. Recognizing quantity names for tabular data. In Joint Proceedings of the First International Workshop on Professional Search (ProfS2018); the Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR); and the International Workshop on Data Search (DATA:SEARCH'18) Co-located with ACM SIGIR, Ann Arbor, Michigan, USA, July 12, 2018 (CEUR Workshop Proceedings, Vol. 2127). CEUR-WS.org, 68--73.
[36]
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. 2018. Neural ranking models with multiple document fields. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining. 700--708.
[37]
Li Zhang, Shuo Zhang, and Krisztian Balog. 2019. Table2vec: Neural word and entity embeddings for table population and retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1029--1032.
[38]
Shuo Zhang and Krisztian Balog. 2017. Design Patterns for Fusion-Based Object Retrieval. In Advances in Information Retrieval, Joemon M Jose, Claudia Hauff, Ismail Sengor Altingovde, Dawei Song, Dyaa Albakour, Stuart Watt, and John Tait (Eds.). 684--690.
[39]
Shuo Zhang and Krisztian Balog. 2018. Ad hoc table retrieval using semantic similarity. In Proceedings of The Web Conference. 1553--1562.
[40]
Shuo Zhang and Krisztian Balog. 2020. Web Table Extraction, Retrieval, and Augmentation: A Survey. ACM Trans. Intell. Syst. Technol., Vol. 11, 2, Article 13 (Jan. 2020), 35 pages.
[41]
Shuo Zhang, Edgar Meij, Krisztian Balog, and Ridho Reinanda. 2020. Novel entity discovery from web tables. In Proceedings of The Web Conference. 1298--1308.

Cited By

View all
  • (2024)A Hierarchical Multi-Task Learning Framework for Semantic Annotation in Tabular DataEntropy10.3390/e2608066426:8(664)Online publication date: 4-Aug-2024
  • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
  • (2024)A Joint Multi-task Learning Model for Web Table-to-Knowledge Graph MatchingKnowledge Science, Engineering and Management10.1007/978-981-97-5492-2_31(406-418)Online publication date: 26-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2021
2998 pages
ISBN:9781450380379
DOI:10.1145/3404835
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataset search
  2. datasets
  3. table search

Qualifiers

  • Short-paper

Funding Sources

  • NSF

Conference

SIGIR '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)97
  • Downloads (Last 6 weeks)26
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Hierarchical Multi-Task Learning Framework for Semantic Annotation in Tabular DataEntropy10.3390/e2608066426:8(664)Online publication date: 4-Aug-2024
  • (2024)A Large Scale Test Corpus for Semantic Table SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657877(1142-1151)Online publication date: 10-Jul-2024
  • (2024)A Joint Multi-task Learning Model for Web Table-to-Knowledge Graph MatchingKnowledge Science, Engineering and Management10.1007/978-981-97-5492-2_31(406-418)Online publication date: 26-Jul-2024
  • (2023)Dataset Search and AugmentationACM SIGIR Forum10.1145/3582524.358254456:1(1-2)Online publication date: 27-Jan-2023
  • (2023)A Taxonomy of Dataset SearchAdvances on Intelligent Computing and Data Science10.1007/978-3-031-36258-3_50(562-573)Online publication date: 17-Aug-2023
  • (2021)MGNETSProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482140(2945-2949)Online publication date: 26-Oct-2021

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media