Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1007/978-3-319-27293-1_36guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Research on Adaptive Wrapper in Deep Web Data Extraction

Published: 19 December 2015 Publication History

Abstract

As the rapid development of Internet technology, Deep Web has the vast amounts of data information, and in the rapid growth of the Web to become a huge data source. Many documents share common HTML tree structure on script generated websites, allowing users to effectively extract interested information from deep webpages by wrappers. However, since tree structure evolves over time, the wrappers break frequently and need to be re-learned. In this paper, we explore the problem of constructing adaptive wrappers in deep webpages. In order to keep web extraction robust when webpages changes, a minimum cost script edit model based on machine learning techniques is proposed. With the method, three edit operations under structural changes are considered, i.e. inserting nodes, deleting nodes and substituting nodes' labels. By obtaining the extraction model for 51job site and then random sampling pages at zhaopin site using this extraction model for training the new wrapper. Besides, the wrapper has high versatility, realizing the adaptation extraction. Experimental results show that the proposed approach can improve the extraction accuracy of target data and effectively solve the adaptive wrapper for the massive Deep Web data.

References

[1]
Myllymaki, J., Jackson, J.: Robust web data extraction with xml path expressions. In CiteSeer 2002
[2]
Dalvi, N., Bohannon, P., Sha, F.: Robust web extraction: an approach based on a probabilistic tree-edit model. In: SIGMOD 2009
[3]
Parameswaran, A., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal schemes for robust web extraction. In: VLDB 2011
[4]
Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. In VLDB 2011
[5]
Baumgartner, R., Gottlob, G., Herzog, M.: Scalable web data extraction for online market intelligence. In: VLDB 2009
[6]
Gupta, R., Sarawagi, S.: Domain adaptation of information extraction models. SIGMOD Rec. 374, 35---40 2008
[7]
Cafarella, M.J., Madhavan, J., Halevy, A.: Web-scale extraction of structured data. In: SIGMOD 2008
[8]
Cafarella, M.J., Halevy, A., Khoussainova, N.: Data integration for the relational web. In: VLDB 2009
[9]
Kasneci, G., Ramanath, M., Suchanek, F., Weikum, G.: The YAGO-NAGA approach to knowledge discovery. SIGMOD Record 374, 41---47 2008
[10]
Kim, Y., Park, J., Kim, T., Choi, J.: Web information extraction by HTML tree edit distance matching. In: ICCIT 2007
[11]
Anton, T.: Xpath-wrapper induction by generating tree traversal patterns. In: LWA, pp. 126---133 2005
[12]
van Rijsbergen, C.: Information Retrieval. Butterworths, London 1979
[13]
Chidlovskii, B., Roustant, B., Brette, M.: Documentum ECI self-repairing wrappers: performance analysis. In: SIGMOD, pp. 708---717 2006
[14]
de Castro Reis, D., Golgher, P.B., da Silve, A.S.: Automatic web news extraction using tree edit distance. In: WWW, pp. 502---511 2004
[15]
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759---770 2009
[16]
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment, In: Proceedings of 14th International Conference on World Wide Web, pp. 76---852005
[17]
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large Web sites. In: VLDB, pp. 109---118 2001

Cited By

View all
  • (2022)A Web Information Extraction Framework with Adaptive and Failure Prediction FeatureJournal of Data and Information Quality10.1145/349500814:2(1-21)Online publication date: 23-Mar-2022

Index Terms

  1. Research on Adaptive Wrapper in Deep Web Data Extraction

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    IOV 2015: Proceedings of the Second International Conference on Internet of Vehicles - Safe and Intelligent Mobility - Volume 9502
    December 2015
    464 pages
    ISBN:9783319272924
    • Editors:
    • Ching-Hsien Hsu,
    • Feng Xia,
    • Xingang Liu,
    • Shangguang Wang

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 19 December 2015

    Author Tags

    1. Deep web data extraction
    2. Machine learning
    3. Minimum cost script
    4. Wrapper

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)A Web Information Extraction Framework with Adaptive and Failure Prediction FeatureJournal of Data and Information Quality10.1145/349500814:2(1-21)Online publication date: 23-Mar-2022

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media