research-article

SDC: structured data collection by yourself

Authors:

Takuya Ohshima,

Motomichi ToyamaAuthors Info & Claims

ICIST '18: Proceedings of the 8th International Conference on Information Systems and Technologies

Article No.: 3, Pages 1 - 8

https://doi.org/10.1145/3200842.3200849

Published: 16 March 2018 Publication History

Abstract

In our research, we focus on ways to crawl and scrape structured data embedded in web pages and collect it as an available data set. To realize large-scale structured data collection, we develop SDC, specialized structured data extraction crawler.

SDC integrates the three modules required for extracting structured data: crawling, scraping, and output generation. In particular, we are paying attention to the fact that only simple syntax and semantics are used for structured data. In this way, information extraction from web pages that exist in multiple domains is realized with a simple single setting. It is not necessary to consider the hierarchical structure of websites and the DOM structure of web pages. The users can also display a part of it as a preview to see if the extracted data is what they need.

However, even if structured data is used, there are cases where how elements are written differently for each domain. For example, in an e-book sales website, one element is given to each author of the book, but in another site, the list of multiple authors is described as a comma-separated string. Also, there is a website that implements navigation such as pagination using Ajax, and it is necessary to execute the corresponding JavaScript to obtain hyperlink from this site. To cope with the difference between these domains, it is possible for users to quickly specify the data extraction mechanism corresponding to individual websites complementary.

As the experiment, we extracted 500 top domains with a large number of links on the websites by our system, that 254 sites where structured data are present, and we have been successfully extracted from 243 sites.

References

[1]

Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: exploring the power of tables on the web. PVLDB 1, 1 (2008), 538--549. http://www.vldb.org/pvldb/171453916.pdf

Digital Library

[2]

Common Crawl {n. d.}. Common Crawl. http://commoncrawl.org/. ({n. d.}).

[3]

DataHub - Frictionless Data {n. d.}. DataHub - Frictionless Data. http://datahub.io/. ({n. d.}).

[4]

Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, and Cheng Wang. 2014. DIADEM: Thousands of Websites to a Single Database. PVLDB 7, 14 (2014), 1845--1856. http://www.vldb.org/pvldb/vol7/p1845-furche.pdf

Digital Library

[5]

Sungmin Joo Hayato Sitow, Ryota Wakebe and Motomichi Toyama. 2011. Keio WIX system (3) contents making. In DEIM2011.

[6]

Import.io | Extract data from the web {n. d.}. Import.io | Extract data from the web. https://www.import.io/. ({n. d.}).

[7]

Sungmin Joo Masahiro Hayashi, Shun Aoyama and Motomichi Toyama. 2011. Keio WIX system(1) User Interface. In DEIM2011.

[8]

Robert Meusel, Peter Mika, and Roi Blanco. 2014. Focused Crawling for Structured Data. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM2014, Shanghai, China, November 3--7, 2014, Jianzhong Li, Xiaoyang Sean Wang, Minos N. Garofalakis, Ian Soboroff, Torsten Suel, and Min Wang (Eds.). ACM, 1039--1048.

Digital Library

[9]

Robert Meusel and Heiko Paulheim. 2015. Heuristics for Fixing Common Errors in Deployed schema.org Microdata. In The Semantic Web. Latest Advances and New Domains - 12th European Semantic Web Conference, ESWC 2015, Portoroz, Slovenia, May 31 - June 4, 2015. Proceedings (Lecture Notes in Computer Science), Fabien Gandon, Marta Sabou, Harald Sack, Claudia d'Amato, Philippe Cudré-Mauroux, and Antoine Zimmermann (Eds.), Vol. 9088. Springer, 152--168.

Digital Library

[10]

Robert Meusel, Petar Petrovski, and Christian Bizer. 2014. The WebDataCommons Microdata, RDFa and Microformat Dataset Series. In The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, Riva del Garda, Italy, October 19--23, 2014. Proceedings, Part I (Lecture Notes in Computer Science), Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig A. Knoblock, Denny Vrandecic, Paul T. Groth, Natasha F. Noy, Krzysztof Janowicz, and Carole A. Goble (Eds.), Vol. 8796. Springer, 277--292.

Digital Library

[11]

Alberto Nogales, Miguel-Ángel Sicilia, Salvador Sánchez Alonso, and Elena García Barriocanal. 2016. Linking from Schema.org microdata to the Web of Linked Data: An empirical assessment. Computer Standards & Interfaces 45 (2016), 90--99.

Digital Library

[12]

Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web. PVLDB 8, 13 (2015), 2194--2205. http://www.vldb.org/pvldb/vol8/p2194-qiu.pdf

Digital Library

[13]

Dominique Ritze, Oliver Lehmberg, Yaser Oulabi, and Christian Bizer. 2016. Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 -- 15, 2016, Jacqueline Bourdeau, Jim Hendler, Roger Nkambou, Ian Horrocks, and Ben Y. Zhao (Eds.). ACM, 251--261.

Digital Library

[14]

Sungmin Joo Ryosuke Mori, Tatsuya Yabu and Motomichi Toyama. 2011. Keio WIX system(2) Server-side Implements. In DEIM2011.

[15]

H.A. Sleiman and R. Corchuelo. 2013. A Survey on Region Extractors from Web Documents. Knowledge and Data Engineering, IEEE Transactions on 25, 9 (Sept. 2013), 1960--1981.

Digital Library

[16]

The Linking Open Data cloud diagram {n. d.}. The Linking Open Data cloud diagram. http://lod-cloud.net/. ({n. d.}).

[17]

James Hendler Tim Berners-Lee and Ora Lassila. 2001. The Semantic Web. Scientific American 284, 5 (2001), 28--37. https://pdfs.semanticscholar.org/566c/1c6bd366b4c9e07fc37eb372771690d5ba31.pdf

[18]

Top Sites: The 500 Most Important Websites on the Internet - Moz {n.d.}. Top Sites: The 500 Most Important Websites on the Internet - Moz. https://moz.com/top500. ({n. d.}).

[19]

Web Data Commons {n.d.}. Web Data Commons. http://webdatacommons.org/. ({n. d.}).

[20]

Web Scraping Services Free Web Crawlers for Data Extraction | Octoparse {n. d.}. Web Scraping Services Free Web Crawlers for Data Extraction | Octoparse. https://www.octoparse.com/. ({n. d.}).

Recommendations

FactCheck - Identify and Fix Conflicting Data on the Web
Web Engineering
Abstract
Today’s Web pages more frequently contain structured information by means of semantically-rich embedded data. These data are currently used by search engines to provide an improved semantic, structured search result. However, very often these data ...
Analysis of approaches to structured data on the web

The early concept of the World Wide Web was the network of related (linked) documents represented in human readable form. The ongoing development leads to another aspect of the web, the web of data. The goal being that the network will provide first-...
Providing Research Graph Data in JSON-LD Using Schema.org
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In this position paper, we describe a pilot project that provides Research Graph records to external web services using JSON-LD. The Research Graph database contains a large-scale graph that links research datasets (i.e., data used to support research) ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIST '18: Proceedings of the 8th International Conference on Information Systems and Technologies

March 2018

84 pages

ISBN:9781450364041

DOI:10.1145/3200842

Conference Chair:
Mohamed Ridda Laouar,
Program Chair:
Ejub Kajan

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICIST '18

ICIST '18: 8th International Conference on Information Systems and Technologies

March 16 - 18, 2018

Istanbul, Turkey

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
97
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents