Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/375551.375556acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Querying websites using compact skeletons

Published: 01 May 2001 Publication History

Abstract

Several commercial applications, such as online comparison shopping and process automation, require integrating information that is scattered across multiple websites or XML documents. Much research has been devoted to this problem, resulting in several research prototypes and commercial implementations. Such systems rely on wrappers that provide relational or other structured interfaces to websites. Traditionally, wrappers have been constructed by hand on a per-website basis, constraining the scalability of the system.
We introduce a website structure inference mechanism called compact skeletons that is a step in the direction of automated wrapper generation. Compact skeletons provide a transformation from websites or other hierarchical data, such as XML documents, to relational tables. We study several classes of compact skeletons and provide polynomial-time algorithms and heuristics for automated construction of compact skeletons from websites. Experimental results show that our heuristics work well in practice. We also argue that compact skeletons are a natural extension of commercially deployed techniques for wrapper construction.

References

[1]
S. Abiteboul. Querying semistructured data. In Proceedings on the International Conference on Database Theory, 1997.]]
[2]
B. Adelberg. Nodose - a tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1998.]]
[3]
N. Ashish and C. Knoblock. Semi-automatic wrapper generation for internet information sources. In Proceedings of CoopIS '97, 1997.]]
[4]
N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. SIGMOD Record, 26(4):8-15, 1997.]]
[5]
P. Atzeni and G. Mecca. Cut and paste. In Proceedings of the Sixteenth ACM Symposium on Principles of Database Systems, pages 144-153, 1997.]]
[6]
C. Beeri and T. Milo. Schemas for integration and translation of structured and semistructured data. In Proceedings of the International Conference on Database Theory, 1999.]]
[7]
S. Brin. Extracting patterns and relations from the world-wide web. In International WebDB Workshop, Valencia, Spain, pages 172-183, 1998.]]
[8]
P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings on the International Conference on Database Theory, 1997.]]
[9]
P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and optimization techniques for unstructured data. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1996.]]
[10]
V. Christophides, S. Abiteboul, S. Cluet, and M. Scholl. From structured documents to novel query facilities. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1994.]]
[11]
S. Cluet, C. Delobel, J. Simeon, and K. Smaga. Your mediators need data conversion! In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1998.]]
[12]
R. Doorenbos, O. Etzioni, and D. Weld. A scalable comparison-shopping agent for the world-wide web. In Proceedings of the First International Conference on Autonomous Agents, 1997.]]
[13]
D. Embley, D. Campbell, Y. Jiang, Y.-K. Ng, R. Smith, S. Liddle, and D. Quass. A conceptual-modeling approach to extracting data from the web. In Proceedings of the 17th International Conference on Conceptual Modeling (ER '98), 1998.]]
[14]
H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, V. Vassalos, and J. Widom. The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2):117-132, 1997.]]
[15]
M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: A system for extracting document type descriptors from XML documents. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 2000.]]
[16]
R. Goldman and J. Widom. Data guides: Enabling query formulation and optimization in semistructured databases. In Proceedings of the 23rd International Conference onVery Large Data Bases, 1997.]]
[17]
A. Gupta, V. Harinarayan, D. Quass, and A. Rajaraman. Method and apparatus for structuring the querying and interpretation of semistructured information. United States Patent number 5,826,258, 1998.]]
[18]
A. Gupta, V. Harinarayan, and A. Rajaraman. Virtual database technology. InProceedings of the Fourteenth International Conference on Data Engineering, February 23-27, 1998, Orlando, Florida, USA, pages 297-301. IEEE Computer Society, 1998.]]
[19]
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Workshop on management of semistructured data, 1997.]]
[20]
IBM Corp. Job listings at IBM corporate website. http://www.ibm.com/employment/us/html/location.html.]]
[21]
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 1997 International Joint Conference on Artificial Intelligence, 1997.]]
[22]
A. Levy, A. Rajaraman, and J. Ordille. Querying heterogeneous information sources using source descriptions. In Proceedings of the 22nd International Conference onVery Large Data Bases, pages 251-262, 1996.]]
[23]
T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proceedings of the 24th International Conference on Very Large Data Bases, 1998.]]
[24]
I. Muslea, S. Minton, and C. Knoblock. Stalker: Learning extraction rules for semistructured, web-based information sources. In Proceedings of AAAI '98: Workshop on AI and Information Integration, 1998.]]
[25]
S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In Proceedings of the ACM SIGMOD International Conference of Mangement of Data, 1998.]]
[26]
S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe. Representative objects: Concide representations of semistructured hierarchical data. In Proceedings of the Thirteenth International Conference on Data Engineering, 1997.]]
[27]
D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying semistructured, heterogeneous information. In Proceedings of the Fourth International Conference onDeductive and Object Oriented Databases, 1995.]]
[28]
A. Rajaraman and J. Ullman. Querying websites using compact skeletons. http://www-db.stanford.edu/~anand/pub/skeleton.ps, 2001.]]
[29]
J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of the 25th International Conference onVery Large Data Bases, 1999.]]
[30]
S. Soderland. Learning to extract text-based information from the world-wide web. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, 1997.]]
[31]
Sun Microsystems. Job listings at Sun Microsystems website. http://www.sun.com/jobs.]]
[32]
J. Ullman. Principles of Database and Knowledge-Base Systems, Volume II: The New Technologies. Computer Science Press, Rockville, MD, 1989.]]
[33]
WhizBang! Labs. Flipdog.com job search website. http://www.flipdog.com/home.html.]]
[34]
WhizBang! Labs. WhizBang! Labs corporate website. http://www.whizbanglabs.com.]]

Cited By

View all
  • (2005)Web data extraction based on structural similarityKnowledge and Information Systems10.1007/s10115-004-0188-z8:4(438-461)Online publication date: 2-Feb-2005
  • (2004)WICCAP: from semi-structured data to structured dataProceedings. 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems, 2004.10.1109/ECBS.2004.1316686(86-93)Online publication date: 2004
  • (2004)OWDEAH: Online Web Data Extraction Based on Access HistoryData Warehousing and Knowledge Discovery10.1007/978-3-540-30076-2_27(269-278)Online publication date: 2004
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
May 2001
301 pages
ISBN:1581133618
DOI:10.1145/375551
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS01
Sponsor:

Acceptance Rates

PODS '01 Paper Acceptance Rate 26 of 99 submissions, 26%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2005)Web data extraction based on structural similarityKnowledge and Information Systems10.1007/s10115-004-0188-z8:4(438-461)Online publication date: 2-Feb-2005
  • (2004)WICCAP: from semi-structured data to structured dataProceedings. 11th IEEE International Conference and Workshop on the Engineering of Computer-Based Systems, 2004.10.1109/ECBS.2004.1316686(86-93)Online publication date: 2004
  • (2004)OWDEAH: Online Web Data Extraction Based on Access HistoryData Warehousing and Knowledge Discovery10.1007/978-3-540-30076-2_27(269-278)Online publication date: 2004
  • (2003)Efficiently Maintaining Structural Associations of Semistructured DataAdvances in Informatics10.1007/3-540-38076-0_8(118-132)Online publication date: 25-Jun-2003
  • (2002)Domain-specific information extraction structuresProceedings. 13th International Workshop on Database and Expert Systems Applications10.1109/DEXA.2002.1045880(80-84)Online publication date: 2002
  • (2002)Superimposed Schematics: Introducing E-R Structure for In-Situ Information SelectionsConceptual Modeling — ER 200210.1007/3-540-45816-6_17(90-104)Online publication date: 26-Sep-2002
  • (2002)Generating Relations from XML DocumentsDatabase Theory — ICDT 200310.1007/3-540-36285-1_19(285-299)Online publication date: 16-Dec-2002

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media