Article

Querying websites using compact skeletons

Authors:

Anand Rajaraman,

Jeffrey D. UllmannAuthors Info & Claims

PODS '01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Pages 16 - 27

https://doi.org/10.1145/375551.375556

Published: 01 May 2001 Publication History

Abstract

Several commercial applications, such as online comparison shopping and process automation, require integrating information that is scattered across multiple websites or XML documents. Much research has been devoted to this problem, resulting in several research prototypes and commercial implementations. Such systems rely on wrappers that provide relational or other structured interfaces to websites. Traditionally, wrappers have been constructed by hand on a per-website basis, constraining the scalability of the system.

We introduce a website structure inference mechanism called compact skeletons that is a step in the direction of automated wrapper generation. Compact skeletons provide a transformation from websites or other hierarchical data, such as XML documents, to relational tables. We study several classes of compact skeletons and provide polynomial-time algorithms and heuristics for automated construction of compact skeletons from websites. Experimental results show that our heuristics work well in practice. We also argue that compact skeletons are a natural extension of commercially deployed techniques for wrapper construction.

References

[1]

S. Abiteboul. Querying semistructured data. In Proceedings on the International Conference on Database Theory, 1997.]]

Digital Library

[2]

B. Adelberg. Nodose - a tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1998.]]

Digital Library

[3]

N. Ashish and C. Knoblock. Semi-automatic wrapper generation for internet information sources. In Proceedings of CoopIS '97, 1997.]]

Digital Library

[4]

N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8-15, 1997.]]

Digital Library

[5]

P. Atzeni and G. Mecca. Cut and paste. In Proceedings of the Sixteenth ACM Symposium on Principles of Database Systems, pages 144-153, 1997.]]

Digital Library

[6]

C. Beeri and T. Milo. Schemas for integration and translation of structured and semistructured data. In Proceedings of the International Conference on Database Theory, 1999.]]

Digital Library

[7]

S. Brin. Extracting patterns and relations from the world-wide web. In International WebDB Workshop, Valencia, Spain, pages 172-183, 1998.]]

Digital Library

[8]

P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings on the International Conference on Database Theory, 1997.]]

Digital Library

[9]

P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization techniques for unstructured data. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1996.]]

Digital Library

[10]

V. Christophides, S. Abiteboul, S. Cluet, and M. Scholl. From structured documents to novel query facilities. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1994.]]

Digital Library

[11]

S. Cluet, C. Delobel, J. Simeon, and K. Smaga. Your mediators need data conversion! In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1998.]]

Digital Library

[12]

R. Doorenbos, O. Etzioni, and D. Weld. A scalable comparison-shopping agent for the world-wide web. In Proceedings of the First International Conference on Autonomous Agents, 1997.]]

Digital Library

[13]

D. Embley, D. Campbell, Y. Jiang, Y.-K. Ng, R. Smith, S. Liddle, and D. Quass. A conceptual-modeling approach to extracting data from the web. In Proceedings of the 17th International Conference on Conceptual Modeling (ER '98), 1998.]]

Digital Library

[14]

H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, V. Vassalos, and J. Widom. The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2):117-132, 1997.]]

Digital Library

[15]

M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: A system for extracting document type descriptors from XML documents. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 2000.]]

Digital Library

[16]

R. Goldman and J. Widom. Data guides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference onVery Large Data Bases, 1997.]]

Digital Library

[17]

A. Gupta, V. Harinarayan, D. Quass, and A. Rajaraman. Method and apparatus for structuring the querying and interpretation of semistructured information. United States Patent number 5,826,258, 1998.]]

[18]

A. Gupta, V. Harinarayan, and A. Rajaraman. Virtual database technology. InProceedings of the Fourteenth International Conference on Data Engineering, February 23-27, 1998, Orlando, Florida, USA, pages 297-301. IEEE Computer Society, 1998.]]

Digital Library

[19]

J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Workshop on management of semistructured data, 1997.]]

[20]

IBM Corp. Job listings at IBM corporate website. http://www.ibm.com/employment/us/html/location.html.]]

[21]

N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 1997 International Joint Conference on Artificial Intelligence, 1997.]]

[22]

A. Levy, A. Rajaraman, and J. Ordille. Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd International Conference onVery Large Data Bases, pages 251-262, 1996.]]

Digital Library

[23]

T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proceedings of the 24th International Conference on Very Large Data Bases, 1998.]]

Digital Library

[24]

I. Muslea, S. Minton, and C. Knoblock. Stalker: Learning extraction rules for semistructured, web-based information sources. In Proceedings of AAAI '98: Workshop on AI and Information Integration, 1998.]]

[25]

S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1998.]]

Digital Library

[26]

S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe. Representative objects: Concide representations of semistructured hierarchical data. In Proceedings of the Thirteenth International Conference on Data Engineering, 1997.]]

Digital Library

[27]

D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying semistructured, heterogeneous information. In Proceedings of the Fourth International Conference onDeductive and Object Oriented Databases, 1995.]]

Digital Library

[28]

A. Rajaraman and J. Ullman. Querying websites using compact skeletons. http://www-db.stanford.edu/~anand/pub/skeleton.ps, 2001.]]

[29]

J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of the 25th International Conference onVery Large Data Bases, 1999.]]

Digital Library

[30]

S. Soderland. Learning to extract text-based information from the world-wide web. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, 1997.]]

Digital Library

[31]

Sun Microsystems. Job listings at Sun Microsystems website. http://www.sun.com/jobs.]]

[32]

J. Ullman. Principles of Database and Knowledge-Base Systems, Volume II: The New Technologies. Computer Science Press, Rockville, MD, 1989.]]

Digital Library

[33]

WhizBang! Labs. Flipdog.com job search website. http://www.flipdog.com/home.html.]]

[34]

WhizBang! Labs. WhizBang! Labs corporate website. http://www.whizbanglabs.com.]]

Cited By

Li ZNg WSun A(2005)Web data extraction based on structural similarityKnowledge and Information Systems10.1007/s10115-004-0188-z8:4(438-461)Online publication date: 2-Feb-2005
https://doi.org/10.1007/s10115-004-0188-z
Zhao Li Wee Keong Ng (2004)WICCAP: from semi-structured data to structured dataProceedings. 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems, 2004.10.1109/ECBS.2004.1316686(86-93)Online publication date: 2004
https://doi.org/10.1109/ECBS.2004.1316686
Li ZNg WOng K(2004)OWDEAH: Online Web Data Extraction Based on Access HistoryData Warehousing and Knowledge Discovery10.1007/978-3-540-30076-2_27(269-278)Online publication date: 2004
https://doi.org/10.1007/978-3-540-30076-2_27
Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PODS '01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

May 2001

301 pages

ISBN:1581133618

DOI:10.1145/375551

Chairman:
Peter Buneman
Univ. of Pennsylvania

Copyright © 2001 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS01

Sponsor:

SIGMOD

SIGMOD/PODS01: ACM SIGMOD International Conference on Management of Data

California, Santa Barbara, USA

Acceptance Rates

PODS '01 Paper Acceptance Rate 26 of 99 submissions, 26%;

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
351
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li ZNg WSun A(2005)Web data extraction based on structural similarityKnowledge and Information Systems10.1007/s10115-004-0188-z8:4(438-461)Online publication date: 2-Feb-2005
https://doi.org/10.1007/s10115-004-0188-z
Zhao Li Wee Keong Ng (2004)WICCAP: from semi-structured data to structured dataProceedings. 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems, 2004.10.1109/ECBS.2004.1316686(86-93)Online publication date: 2004
https://doi.org/10.1109/ECBS.2004.1316686
Li ZNg WOng K(2004)OWDEAH: Online Web Data Extraction Based on Access HistoryData Warehousing and Knowledge Discovery10.1007/978-3-540-30076-2_27(269-278)Online publication date: 2004
https://doi.org/10.1007/978-3-540-30076-2_27
Katsaros D(2003)Efficiently Maintaining Structural Associations of Semistructured DataAdvances in Informatics10.1007/3-540-38076-0_8(118-132)Online publication date: 25-Jun-2003
https://doi.org/10.1007/3-540-38076-0_8
Lyons SSmith D(2002)Domain-specific information extraction structuresProceedings. 13th International Workshop on Database and Expert Systems Applications10.1109/DEXA.2002.1045880(80-84)Online publication date: 2002
https://doi.org/10.1109/DEXA.2002.1045880
Bowers SDelcambre LMaier D(2002)Superimposed Schematics: Introducing E-R Structure for In-Situ Information SelectionsConceptual Modeling — ER 200210.1007/3-540-45816-6_17(90-104)Online publication date: 26-Sep-2002
https://doi.org/10.1007/3-540-45816-6_17
Cohen SKanza YSagiv Y(2002)Generating Relations from XML DocumentsDatabase Theory — ICDT 200310.1007/3-540-36285-1_19(285-299)Online publication date: 16-Dec-2002
https://doi.org/10.1007/3-540-36285-1_19

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten