Data extraction from the web based on pre-defined schema

Meng Xiaofeng¹,
Lu Hongjun²,
Gang Haiyan¹ &
…
Gu Mingzhe¹

67 Accesses
Explore all metrics

Abstract

With the development of the Internet, the World Wide Web has become an invaluable information source for most organizations. However, most documents available from the Web are in HTML form which is originally designed for document formatting with little consideration of its contents. Effectively extracting data from such documents remains a nontrivial task. In this paper, we present a schema-guided approach to extracting data from HTML pages. Under the approach, the user defines a schema specifying what to be extracted and provides sample mappings between the schema and the HTML page. The system will induce the mapping rules and generate a wrapper that takes the HTML page as input and produces the required data in the form of XML conforming to the user-defined schema. A prototype system implementing the approach has been developed. The preliminary experiments indicate that the proposed semi-automatic approach is not only easy to use but also able to produce a wrapper that extracts required data from inputted pages with high accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

Efficient Page-Level Data Extraction via Schema Induction and Verification

User-Friendly and Extensible Web Data Extraction

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Hammer J, Brenning M, Garcia-Molina Het al. Template-based wrappers in the TSIMMIS system. InProc. ACM SIGMOD’97, Tucson, Arizona, May, 1997, pp.532–535.
Doorenbos R, Etsionoi O, Weld D S. A scalable comparison-shopping agent for the World-Wide-Web. InProc. the First Int. Conference on Autonomous Agents, ACM Press, New York, February, 1997, pp.39–48.
Chapter Google Scholar
Knoblock C A, Minton S, Ambite J Let al. Modeling web sources for information integration. InProc. AAAI’98, Madison, WI, 1998, pp.211–218.
Kushmerick N, Weil D, Doorenbos R. Wrapper induction for information extraction. InProc. Int. Joint Conference on Artificial Intelligence (IJCAI’97), Nagoya, Japan, 1997, pp.729–735.
Sahuguet A, Azavant F. WysiWyg web wrapper factory (W4F). InProc. WWW’99, Toronto, Oct., 1999.
Liu L, Pu C, Han W, XWRAP: An XML-enabled wrapper construction system for web information sources. InProc. the International Conference on Data Engineering, San Diego, CA, USA, 2000, pp.611–621.
Baumgartner R, Flesca S, Gottlob G. Visual web information extraction with Lixto. InProc. the VLDB’01, Roma, Italy, Sept., 2001, pp.119–128.
World Wide Web Consortium (W3C).The Document Object Model, http://www.w3.org/DOM, 1998.
Abiteboul S, Quass D, McHugh Jet al. The lorel query language for semi-structured data.Journal of Digital Libraries, Apr. 1997, 1(1): 68–88.
Google Scholar
Raggett D. Clean up your web pages with HTML tidy. http://www.w3.org/People/Raggett/tidy/, 2000.
Meng X F, Lu H J, Gu M Zet al. A schema-guided wrapper generation for the web. InProc. ICDE Demo, Feb., 2002.
Abiteboulm S, Buneman P, Suciu D. Data on the Web — From Relations to Semi-Structured Data and XML. Morgan Kaufmann Pub., 2000.

Download references

Author information

Authors and Affiliations

School of Information, Renmin University of China, 100872, Beijing, P.R. China
Meng Xiaofeng, Gang Haiyan & Gu Mingzhe
Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong, P.R. China
Lu Hongjun

Authors

Meng Xiaofeng
View author publications
You can also search for this author in PubMed Google Scholar
Lu Hongjun
View author publications
You can also search for this author in PubMed Google Scholar
Gang Haiyan
View author publications
You can also search for this author in PubMed Google Scholar
Gu Mingzhe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meng Xiaofeng.

Additional information

This research is partially supported by the National Natural Science Foundation of China under Grant No.60073014.

MENG Xiaofeng is a professor of the School of Information, Renmin University of China. He received his M.S. degree from Renmin University of China and the Ph.D. degree from the Institute of Computing Technology, the Chinese Academy of Sciences. His recent research work includes Web data management and mobile data management. He is also interested in DBMS implementation and natural language interfacing. He has published more than 40 research papers in database-related international journals and conferences. Now he is a member of the Database Society of CCF, ACM SIGMOD and IEEE CS.

LU Hongjun is a professor at the Department of Computer Science, Hong Kong University of Science and Technology (HKUST). He received his B.Sc. in automatic control from Tsinghua University, China, and M.Sc. and Ph.D. in computer science from the University of Wisconsin, Madison. Before joining HKUST, Dr. Lu served as a principal research scientist at the Computer Science Center of Honeywell Inc. in 1985–1987 and a professor at the National University of Singapore in 1987–2000. His recent research work includes data quality, data warehousing and data mining. He is also interested in development of Internet-based database applications and electronic business systems. He has published more than 100 research papers in databaserelated international journals and conferences.

WANG Haiyan is a graduate student in the School of Information, Renmin University of China. She received her B.Sc. in computer application from Renmin University of China in 2000. Her research interests include Web data management.

GU Mingzhe is a graduate student in the School of Information, Renmin University of China. She received her B.Sc. in computer application from Renmin University of China in 1999. Her research interests include Web data management.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Meng, X., Lu, H., Gang, H. et al. Data extraction from the web based on pre-defined schema. J. Comput. Sci. & Technol. 17, 377–388 (2002). https://doi.org/10.1007/BF02943278

Download citation

Received: 18 June 2001
Revised: 16 November 2001
Issue Date: July 2002
DOI: https://doi.org/10.1007/BF02943278

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Efficient Page-Level Data Extraction via Schema Induction and Verification

User-Friendly and Extensible Web Data Extraction

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Data extraction from the web based on pre-defined schema

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Efficient Page-Level Data Extraction via Schema Induction and Verification

User-Friendly and Extensible Web Data Extraction

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now