Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/956699.956701acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Schema-guided wrapper maintenance for web-data extraction

Published: 07 November 2003 Publication History

Abstract

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel schema-guided approach to the problem of automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as syntactic patterns, annotations, and hyperlinks of the extracted data items. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repair wrappers correspondingly by inducing semantic blocks from the HTML tree. Our intensive experiments on real Web sites show that the proposed approach can effectively maintain wrappers to extract desired data with high accuracies.

References

[1]
Ashish N, Knoblock C A. Wrapper generation for semi-structured Internet sources. SIGMOD Record, 1997, 26(4): 8--15.
[2]
Baumgartner R, Flesca S., Gottlob G. Visual Web Information Extraction with Lixto. In Proceedings of the VLDB 2001, 119--128.
[3]
Brin S. Extracting patterns and relations from the world wide Web. In International WebDB Workshop, Valencia, Spain, pages 172--183, 1998.
[4]
Chidlovskii B. Automatic repairing of Web Wrappers. In 3rd International Workshop on Web Information and Data Management, 2001, 24--30.
[5]
Doorenbos R, Etsionoi O, Weld D S. A scalable comparison-shopping agent for the world-wide-Web. In Proceedings of the First International Conference on Autonomous Agents, 1997, 39--48.
[6]
Gupta A., Harinarayan V., Quass D., and Rajaraman A. Method and apparatus for structuring the querying and interpretation of semistructured information. United States Patent number 5,826,258, 1998.
[7]
Hammer J, Brenning M, Garcia-Molina H, Nestorov S, VassalosV, Yerneni R. Template-based wrappers in the TSIMMIS system. In Proceedings of ACM SIGMOD Conference, 1997, 532--535.
[8]
Knoblock C A, Lerman K, Minton S, Muslea I. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2000, 23(4): 33--41.
[9]
Kushmerick N, Weil D, Doorenbos R. Wrapper induction for information extraction. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 1997, 729--735.
[10]
Kushmerick N. Regression testing for wrapper maintenance. In Proceedings of AAAI, 1999,74--79.
[11]
Kushmerick N. Wrapper verification. World Wide Web Journal, 2000, 3(2): 79--94.
[12]
Kerman K. and Minton S. Learning the common structure of data. In AAAI2000.
[13]
Liu L, Pu C, Han W. XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources. In Proceedings of ICDE, 2000, 611--621.
[14]
Meng X F, Lu H J, Wang H Y, Gu M Z. SG-WRAP: A Schema-Guided Wrapper Generator. Demonstration in Proceedings of ICDE, 2002, 331--332.
[15]
Meng X F, Lu H J, Wang H Y, Gu M Z. Schema-Guided Data Extraction from the Web. Journal of Computer Science and Technology (JCST), 2002,17(4).
[16]
Meng X F, Wang H Y, Hu D D, SG-WRAM: Schema Guided Wrapper Maintenance, Demonstration in Proceedings of ICDE, 2003, 750--752.
[17]
Sahuguet A, Azavant F. Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. In Proceedings of VLDB, 1999, 738--741.
[18]
Crescenzi V, Mecca G, Merialdo P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of VLDB, 2001, 109--118.
[19]
Arasu A, Garcia-Molina H. Extracting Structured Data from Web Pages. In Proceedings of ACM SIGMOD Conference, 2003, 337--348.
[20]
Arlotta L, Crescenzi V, Mecca G, Merialdo. Automatic Annotation of Data Extracted from Large Web Sites. In WebDB, 2003.
[21]
W3C. DOM Document Object Model (DOM) Level 2 Core Specification? http://www.w3.org/TR/DOM-Level-2-Core
[22]
W3C. XML Path Language (XPath) 2.0, http://www.w3.org/TR/xpath20
[23]
W3C. XQuery 1.0: An XML Query Language, http://www.w3.org/TR/xquery/
[24]
Yahoo News: http://news.yahoo.com
[25]
Yahoo Shopping: http://shopping.yahoo.com

Cited By

View all
  • (2021)Engineering Web Augmentation software: A development method for enabling end-user maintenanceInformation and Software Technology10.1016/j.infsof.2021.106735(106735)Online publication date: Oct-2021
  • (2019)What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template ExtractorsACM Transactions on the Web10.1145/331681013:2(1-19)Online publication date: 27-Mar-2019
  • (2019)Adaptively Extracting Structured Data from Web Pages2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00221(1524-1525)Online publication date: Dec-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data management
November 2003
164 pages
ISBN:1581137257
DOI:10.1145/956699
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. extraction
  2. maintenance
  3. schema
  4. web
  5. wrapper

Qualifiers

  • Article

Conference

CIKM03

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Engineering Web Augmentation software: A development method for enabling end-user maintenanceInformation and Software Technology10.1016/j.infsof.2021.106735(106735)Online publication date: Oct-2021
  • (2019)What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template ExtractorsACM Transactions on the Web10.1145/331681013:2(1-19)Online publication date: 27-Mar-2019
  • (2019)Adaptively Extracting Structured Data from Web Pages2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00221(1524-1525)Online publication date: Dec-2019
  • (2019)A Modified Approach for Automatic Extraction of Geographic Data2019 2nd International Conference on Computing, Mathematics and Engineering Technologies (iCoMET)10.1109/ICOMET.2019.8673430(1-7)Online publication date: Jan-2019
  • (2019)Deep Web crawlingWorld Wide Web10.1007/s11280-018-0602-122:4(1577-1610)Online publication date: 1-Jul-2019
  • (2018)Wrapper StabilityEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_1169(4731-4732)Online publication date: 7-Dec-2018
  • (2017)Wrapper approaches for web data extraction : A review2017 6th International Conference on Electrical Engineering and Informatics (ICEEI)10.1109/ICEEI.2017.8312458(1-6)Online publication date: Nov-2017
  • (2016)Automating Web Tasks by Simulating Browser Behaviors2016 International Conference on Platform Technology and Service (PlatCon)10.1109/PlatCon.2016.7456785(1-5)Online publication date: Feb-2016
  • (2016)Joint repairs for web wrappers2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498320(1146-1157)Online publication date: May-2016
  • (2016)Wrapper StabilityEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_1169-2(1-2)Online publication date: 25-Oct-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media