Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3200842.3200849acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicistConference Proceedingsconference-collections
research-article

SDC: structured data collection by yourself

Published: 16 March 2018 Publication History

Abstract

In our research, we focus on ways to crawl and scrape structured data embedded in web pages and collect it as an available data set. To realize large-scale structured data collection, we develop SDC, specialized structured data extraction crawler.
SDC integrates the three modules required for extracting structured data: crawling, scraping, and output generation. In particular, we are paying attention to the fact that only simple syntax and semantics are used for structured data. In this way, information extraction from web pages that exist in multiple domains is realized with a simple single setting. It is not necessary to consider the hierarchical structure of websites and the DOM structure of web pages. The users can also display a part of it as a preview to see if the extracted data is what they need.
However, even if structured data is used, there are cases where how elements are written differently for each domain. For example, in an e-book sales website, one element is given to each author of the book, but in another site, the list of multiple authors is described as a comma-separated string. Also, there is a website that implements navigation such as pagination using Ajax, and it is necessary to execute the corresponding JavaScript to obtain hyperlink from this site. To cope with the difference between these domains, it is possible for users to quickly specify the data extraction mechanism corresponding to individual websites complementary.
As the experiment, we extracted 500 top domains with a large number of links on the websites by our system, that 254 sites where structured data are present, and we have been successfully extracted from 243 sites.

References

[1]
Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. PVLDB 1, 1 (2008), 538--549. http://www.vldb.org/pvldb/171453916.pdf
[2]
Common Crawl {n. d.}. Common Crawl. http://commoncrawl.org/. ({n. d.}).
[3]
DataHub - Frictionless Data {n. d.}. DataHub - Frictionless Data. http://datahub.io/. ({n. d.}).
[4]
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, and Cheng Wang. 2014. DIADEM: Thousands of Websites to a Single Database. PVLDB 7, 14 (2014), 1845--1856. http://www.vldb.org/pvldb/vol7/p1845-furche.pdf
[5]
Sungmin Joo Hayato Sitow, Ryota Wakebe and Motomichi Toyama. 2011. Keio WIX system (3) contents making. In DEIM2011.
[6]
Import.io | Extract data from the web {n. d.}. Import.io | Extract data from the web. https://www.import.io/. ({n. d.}).
[7]
Sungmin Joo Masahiro Hayashi, Shun Aoyama and Motomichi Toyama. 2011. Keio WIX system(1) User Interface. In DEIM2011.
[8]
Robert Meusel, Peter Mika, and Roi Blanco. 2014. Focused Crawling for Structured Data. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM2014, Shanghai, China, November 3--7, 2014, Jianzhong Li, Xiaoyang Sean Wang, Minos N. Garofalakis, Ian Soboroff, Torsten Suel, and Min Wang (Eds.). ACM, 1039--1048.
[9]
Robert Meusel and Heiko Paulheim. 2015. Heuristics for Fixing Common Errors in Deployed schema.org Microdata. In The Semantic Web. Latest Advances and New Domains - 12th European Semantic Web Conference, ESWC 2015, Portoroz, Slovenia, May 31 - June 4, 2015. Proceedings (Lecture Notes in Computer Science), Fabien Gandon, Marta Sabou, Harald Sack, Claudia d'Amato, Philippe Cudré-Mauroux, and Antoine Zimmermann (Eds.), Vol. 9088. Springer, 152--168.
[10]
Robert Meusel, Petar Petrovski, and Christian Bizer. 2014. The WebDataCommons Microdata, RDFa and Microformat Dataset Series. In The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19--23, 2014. Proceedings, Part I (Lecture Notes in Computer Science), Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig A. Knoblock, Denny Vrandecic, Paul T. Groth, Natasha F. Noy, Krzysztof Janowicz, and Carole A. Goble (Eds.), Vol. 8796. Springer, 277--292.
[11]
Alberto Nogales, Miguel-Ángel Sicilia, Salvador Sánchez Alonso, and Elena García Barriocanal. 2016. Linking from Schema.org microdata to the Web of Linked Data: An empirical assessment. Computer Standards & Interfaces 45 (2016), 90--99.
[12]
Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web. PVLDB 8, 13 (2015), 2194--2205. http://www.vldb.org/pvldb/vol8/p2194-qiu.pdf
[13]
Dominique Ritze, Oliver Lehmberg, Yaser Oulabi, and Christian Bizer. 2016. Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 -- 15, 2016, Jacqueline Bourdeau, Jim Hendler, Roger Nkambou, Ian Horrocks, and Ben Y. Zhao (Eds.). ACM, 251--261.
[14]
Sungmin Joo Ryosuke Mori, Tatsuya Yabu and Motomichi Toyama. 2011. Keio WIX system(2) Server-side Implements. In DEIM2011.
[15]
H.A. Sleiman and R. Corchuelo. 2013. A Survey on Region Extractors from Web Documents. Knowledge and Data Engineering, IEEE Transactions on 25, 9 (Sept. 2013), 1960--1981.
[16]
The Linking Open Data cloud diagram {n. d.}. The Linking Open Data cloud diagram. http://lod-cloud.net/. ({n. d.}).
[17]
James Hendler Tim Berners-Lee and Ora Lassila. 2001. The Semantic Web. Scientific American 284, 5 (2001), 28--37. https://pdfs.semanticscholar.org/566c/1c6bd366b4c9e07fc37eb372771690d5ba31.pdf
[18]
Top Sites: The 500 Most Important Websites on the Internet - Moz {n.d.}. Top Sites: The 500 Most Important Websites on the Internet - Moz. https://moz.com/top500. ({n. d.}).
[19]
Web Data Commons {n.d.}. Web Data Commons. http://webdatacommons.org/. ({n. d.}).
[20]
Web Scraping Services Free Web Crawlers for Data Extraction | Octoparse {n. d.}. Web Scraping Services Free Web Crawlers for Data Extraction | Octoparse. https://www.octoparse.com/. ({n. d.}).

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICIST '18: Proceedings of the 8th International Conference on Information Systems and Technologies
March 2018
84 pages
ISBN:9781450364041
DOI:10.1145/3200842
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. JSON-LD
  2. microdata
  3. schema.org
  4. semantic web
  5. structured data

Qualifiers

  • Research-article

Conference

ICIST '18

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 97
    Total Downloads
  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media