Article

Research on Adaptive Wrapper in Deep Web Data Extraction

Authors:

Xin LiuAuthors Info & Claims

IOV 2015: Proceedings of the Second International Conference on Internet of Vehicles - Safe and Intelligent Mobility - Volume 9502

Pages 409 - 423

https://doi.org/10.1007/978-3-319-27293-1_36

Published: 19 December 2015 Publication History

Abstract

As the rapid development of Internet technology, Deep Web has the vast amounts of data information, and in the rapid growth of the Web to become a huge data source. Many documents share common HTML tree structure on script generated websites, allowing users to effectively extract interested information from deep webpages by wrappers. However, since tree structure evolves over time, the wrappers break frequently and need to be re-learned. In this paper, we explore the problem of constructing adaptive wrappers in deep webpages. In order to keep web extraction robust when webpages changes, a minimum cost script edit model based on machine learning techniques is proposed. With the method, three edit operations under structural changes are considered, i.e. inserting nodes, deleting nodes and substituting nodes' labels. By obtaining the extraction model for 51job site and then random sampling pages at zhaopin site using this extraction model for training the new wrapper. Besides, the wrapper has high versatility, realizing the adaptation extraction. Experimental results show that the proposed approach can improve the extraction accuracy of target data and effectively solve the adaptive wrapper for the massive Deep Web data.

References

[1]

Myllymaki, J., Jackson, J.: Robust web data extraction with xml path expressions. In CiteSeer 2002

[2]

Dalvi, N., Bohannon, P., Sha, F.: Robust web extraction: an approach based on a probabilistic tree-edit model. In: SIGMOD 2009

Digital Library

[3]

Parameswaran, A., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal schemes for robust web extraction. In: VLDB 2011

[4]

Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. In VLDB 2011

Digital Library

[5]

Baumgartner, R., Gottlob, G., Herzog, M.: Scalable web data extraction for online market intelligence. In: VLDB 2009

Digital Library

[6]

Gupta, R., Sarawagi, S.: Domain adaptation of information extraction models. SIGMOD Rec. 374, 35---40 2008

Digital Library

[7]

Cafarella, M.J., Madhavan, J., Halevy, A.: Web-scale extraction of structured data. In: SIGMOD 2008

Digital Library

[8]

Cafarella, M.J., Halevy, A., Khoussainova, N.: Data integration for the relational web. In: VLDB 2009

Digital Library

[9]

Kasneci, G., Ramanath, M., Suchanek, F., Weikum, G.: The YAGO-NAGA approach to knowledge discovery. SIGMOD Record 374, 41---47 2008

Digital Library

[10]

Kim, Y., Park, J., Kim, T., Choi, J.: Web information extraction by HTML tree edit distance matching. In: ICCIT 2007

Digital Library

[11]

Anton, T.: Xpath-wrapper induction by generating tree traversal patterns. In: LWA, pp. 126---133 2005

[12]

van Rijsbergen, C.: Information Retrieval. Butterworths, London 1979

Digital Library

[13]

Chidlovskii, B., Roustant, B., Brette, M.: Documentum ECI self-repairing wrappers: performance analysis. In: SIGMOD, pp. 708---717 2006

Digital Library

[14]

de Castro Reis, D., Golgher, P.B., da Silve, A.S.: Automatic web news extraction using tree edit distance. In: WWW, pp. 502---511 2004

Digital Library

[15]

Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759---770 2009

Digital Library

[16]

Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment, In: Proceedings of 14th International Conference on World Wide Web, pp. 76---852005

Digital Library

[17]

Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large Web sites. In: VLDB, pp. 109---118 2001

Digital Library

Cited By

Patnaik SBabu C(2022)A Web Information Extraction Framework with Adaptive and Failure Prediction FeatureJournal of Data and Information Quality10.1145/349500814:2(1-21)Online publication date: 23-Mar-2022
https://dl.acm.org/doi/10.1145/3495008

Index Terms

Research on Adaptive Wrapper in Deep Web Data Extraction
1. Information systems
  1. Information retrieval

Recommendations

Robust web data extraction: a novel approach based on minimum cost script edit model
WISM'12: Proceedings of the 2012 international conference on Web Information Systems and Mining

Many documents share common HTML tree structure on script generated websites, allowing us to effectively extract interested information from deep webpage by wrappers. Since tree structure evolves over time, the wrappers break frequently and need to be ...
Schema-guided wrapper maintenance for web-data extraction
WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data management

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. ...
How to make web sites talk together: web service solution
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

Integrating web sites to provide more efficient services is a very promising way in the Internet. For example searching house for rent based on train system or preparing a holiday with several constrains such as hotel, air ticket, etc... From resource ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

IOV 2015: Proceedings of the Second International Conference on Internet of Vehicles - Safe and Intelligent Mobility - Volume 9502

December 2015

464 pages

ISBN:9783319272924

Editors:
Ching-Hsien Hsu
Department of Computer Science, Chung Hua University, Hsinchu, Taiwan
,
Feng Xia
School of Software, Dalian University of Technology, Dalian, China
,
Xingang Liu
School of Electronic Engineering, University of Electronic Science and Tec, Chengdu, China
,
Shangguang Wang
Network Technology Institute, Beijing University of Posts & Telec, Beijing, China

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 19 December 2015

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Patnaik SBabu C(2022)A Web Information Extraction Framework with Adaptive and Failure Prediction FeatureJournal of Data and Information Quality10.1145/349500814:2(1-21)Online publication date: 23-Mar-2022
https://dl.acm.org/doi/10.1145/3495008

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents