Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Ten years of webtables

Published: 01 August 2018 Publication History

Abstract

In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The past decade has seen a flurry of research and commercial activities around the WebTables project itself, as well as the broad topic of informal online structured data. In this paper, we1 will review the WebTables project, and try to place it in the broader context of the decade of work that followed. We will also show how the progress over the past ten years sets up an exciting agenda for the future, and will draw upon many corners of the data management community.

References

[1]
Common crawl. http://commoncrawl.org/.
[2]
S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over relational databases. In Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002, pages 5--16, 2002.
[3]
S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2015, Online Proceedings, 2015.
[4]
T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34--43, May 2001.
[5]
C. Bizer. The emerging web of linked data. IEEE Intelligent Systems, 24(5):87--92, Sept. 2009.
[6]
M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009.
[7]
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.
[8]
M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In 11th International Workshop on the Web and Databases, WebDB 2008, Vancouver, BC, Canada, June 13, 2008, 2008.
[9]
K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, Y. He, and W. Redmond. Data services leveraging bing's data assets. IEEE Data Eng. Bull., 2016.
[10]
H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale html texts. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1, COLING '00, pages 166--172, Stroudsburg, PA, USA, 2000. Association for Computational Linguistics.
[11]
Z. Chen, M. J. Cafarella, and H. V. Jagadish. Long-tail vocabulary dictionary extraction from the web. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 22--25, 2016, pages 625--634, 2016.
[12]
F. Chirigati, J. Liu, F. Korn, Y. Wu, C. Yu, and H. Zhang. Knowledge exploration using tables on the web. PVLDB, 10(3):193--204, 2016.
[13]
X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. TEGRA: table extraction by global record alignment. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 1713--1728, 2015.
[14]
D. Deng, Y. Jiang, G. Li, J. Li, and C. Yu. Scalable column concept determination for web tables using large knowledge bases. PVLDB, 6(13):1606--1617, 2013.
[15]
J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, and W. Lehner. Building the dresden web table corpus: A classification approach. In 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015, Limassol, Cyprus, December 7--10, 2015, pages 41--50, 2015.
[16]
H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009.
[17]
J. Fan, M. Lu, B. C. Ooi, W. Tan, and M. Zhang. A hybrid machine-crowdsourcing system for matching web tables. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 976--987, 2014.
[18]
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8--12, 2007, pages 71--80, 2007.
[19]
R. Gupta, A. Y. Halevy, X. Wang, S. E. Whang, and F. Wu. Biperpedia: An ontology for search applications. PVLDB, 7(7):505--516, 2014.
[20]
A. Y. Halevy. Data publishing and sharing using fusion tables. In CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 6--9, 2013, Online Proceedings, 2013.
[21]
Y. He, K. Chakrabarti, T. Cheng, and T. Tylenda. Automatic discovery of attribute synonyms using query logs and table corpora. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pages 1429--1439, 2016.
[22]
V. Hristidis and Y. Papakonstantinou. DISCOVER: keyword search in relational databases. In VLDB 2002, Proceedings of 28th International Conference on Very Large Data Bases, August 20--23, 2002, Hong Kong, China, pages 670--681, 2002.
[23]
D. F. Huynh, D. R. Karger, and R. C. Miller. Exhibit: lightweight structured data publishing. In Proceedings of the 16th international conference on World Wide Web, pages 737--746. ACM, 2007.
[24]
O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11--15, 2016, Companion Volume, pages 75--76, 2016.
[25]
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. PVLDB, 3(1):1338--1347, 2010.
[26]
X. Ling, A. Y. Halevy, F. Wu, and C. Yu. Synthesizing union tables from the web. In IJCAI 2013, Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China, August 3--9, 2013, pages 2677--2683, 2013.
[27]
R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. PVLDB, 5(10):908--919, 2012.
[28]
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017.
[29]
T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. Parameswaran, and C. Ré. Slimfast: Guaranteed results for data fusion and source reliability. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1399--1414. ACM, 2017.
[30]
A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu. Finding related tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, pages 817--828, 2012.
[31]
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697--706. ACM, 2007.
[32]
P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu. Recovering semantics of tables on the web. PVLDB, 4(9):528--538, 2011.
[33]
D. Vrandecic and M. Krötzsch. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78--85, 2014.
[34]
C. Wang, K. Chakrabarti, Y. He, K. Ganjam, Z. Chen, and P. A. Bernstein. Concept expansion using web tables. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18--22, 2015, pages 1198--1208, 2015.
[35]
D. Z. Wang, L. Dong, A. D. Sarma, M. J. Franklin, and A. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In WebKB, 2009.
[36]
J. Wang, H. Wang, Z. Wang, and K. Q. Zhu. Understanding tables on the web. In Conceptual Modeling - 31st International Conference ER 2012, Florence, Italy, October 15--18, 2012. Proceedings, pages 141--155, 2012.
[37]
Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In Proceedings of the Eleventh International World Wide Web Conference, WWW 2002, May 7--11, 2002, Honolulu, Hawaii, USA, pages 242--250, 2002.
[38]
Y. Wu, P. K. Agarwal, C. Li, J. Yang, and C. Yu. Toward computational fact-checking. PVLDB, 7(7):589--600, 2014.
[39]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, pages 97--108, 2012.
[40]
X. Yin, W. Tan, and C. Liu. FACTO: a fact lookup engine based on web tables. In Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 - April 1, 2011, pages 507--516, 2011.
[41]
R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition. IJDAR, 7(1):1--16, 2004.
[42]
C. Zhang, J. Shin, C. Ré, M. J. Cafarella, and F. Niu. Extracting databases from dark data with deepdive. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 847--859, 2016.
[43]
M. Zhang and K. Chakrabarti. Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22--27, 2013, pages 145--156, 2013.

Cited By

View all
  • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
  • (2024)NPEL: Neural Paired Entity Linking in Web TablesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3652511Online publication date: 19-Mar-2024
  • (2024)The Web Data Commons Schema.org Table CorporaCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651441(1079-1082)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 12
August 2018
426 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2018
Published in PVLDB Volume 11, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
  • (2024)NPEL: Neural Paired Entity Linking in Web TablesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3652511Online publication date: 19-Mar-2024
  • (2024)The Web Data Commons Schema.org Table CorporaCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651441(1079-1082)Online publication date: 13-May-2024
  • (2023)Dataset Discovery and Exploration: A SurveyACM Computing Surveys10.1145/362652156:4(1-37)Online publication date: 9-Nov-2023
  • (2023)GitTables: A Large-Scale Corpus of Relational TablesProceedings of the ACM on Management of Data10.1145/35887101:1(1-17)Online publication date: 30-May-2023
  • (2023)Data collection and quality challenges in deep learning: a data-centric AI perspectiveThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00775-932:4(791-813)Online publication date: 3-Jan-2023
  • (2022)Tab2KGSemantic Web10.3233/SW-22299313:3(571-597)Online publication date: 1-Jan-2022
  • (2022)On extracting data from tables that are encoded using HTMLKnowledge-Based Systems10.1016/j.knosys.2019.105157190:COnline publication date: 22-Apr-2022
  • (2022)A coral-reef approach to extract information from HTML tablesApplied Soft Computing10.1016/j.asoc.2021.107980115:COnline publication date: 6-May-2022
  • (2022)Matching news articles and wikipedia tables for news augmentationKnowledge and Information Systems10.1007/s10115-022-01815-065:4(1713-1734)Online publication date: 27-Dec-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media