Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

Published: 01 June 2008 Publication History

Abstract

Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies---attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model's parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models.

References

[1]
Nicholas J. Adams and Christopher K. I. Williams. SDTs: Sparse dynamic trees. In Artificial Neural Networks, 1999.
[2]
Arwind Arasu and Hector Garcia-Molina. Extracting structured data from webpages. In Proc. of the International Conference on Management of Data, San Diego, CA, 2003.
[3]
David Buttler, Ling Liu, and Calton Pu. A fully automated object extraction system for the world wide web. In Proc. of International Conference on Distributed Computing Systems, Arizona, USA, 2001.
[4]
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Block-based web search. In Proc. of the Internaltinoal Conference on Information Retrieval, Sheffield, UK, 2004.
[5]
Miguel A. Carreira-Perpinan and Geoffrey E. Hinton. On contrastive divergence learning. In Proc. of Artificial Intelligence and Statistics, Barbados, 2005.
[6]
Chia-Hui Chang and Shao-Chen Lui. IEPAD: Information extraction based on pattern discovery. In Proc. of the International World Wide Web Conference, Hong Kong, China, 2001.
[7]
Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer, New York, NY, 1999.
[8]
Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the Conference on Very Large Data Bases, Rome, Italy, 2001.
[9]
Aron Culotta, Trausti Kristjansson, Andrew McCallum, and Paul Viola. Corrective feedback and persistent learning for information extraction. Artificial Intelligence Journal, 170(14):1101- 1122, 2006.
[10]
David W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In Proc. of the International Conference on Management of Data, Philadephia, PA, 1999.
[11]
Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak. Towards domain-independent information extraction from web tables. In Proc. of the International World Wide Web Conference, Banff, Canada, 2007.
[12]
Lise Getoor, Nir Friedman, Daphne Koller, and Benjamin Taskar. Learning probabilistic models of relational structure. In Proc. of the International Conference on Machine Learning, Williams College, Williamstown, MA, 2001.
[13]
Xuming He, Richard S. Zemel, and Miguel A. Carreira-Perpinan. Multiscale conditional random fields for image labeling. In IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, 2004.
[14]
Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800, 2002.
[15]
William W. Irving, Paul W. Fieguth, and Alan S. Willsky. An overlapping tree approach to multi-scale stochastic modeling and estimation. IEEE Trans. on Image Processing, 6(11):1517-1529, 1997.
[16]
Michael I. Jordan, Zoubin Ghahramani, Tommis Jaakkola, and Lawrence K. Saul. An Introduction to Variational Methods for Graphical Models. M. I. Jordan (Ed.), Learning in Graphical Models, Cambridge: MIT Press, Cambridge, MA, 1999.
[17]
Zoltan Kato, Marc Berthod, and Josiane Zerubia. Multiscale Markov random field models for parallel image classification. In IEEE International Conference on Computer Vision, Berlin, Germany, 1993.
[18]
Sanjiv Kumar and Martial Hebert. A hierarchical field framework for unified context-based classification. In IEEE International Conference on Computer Vision, Beijing, China, 2005.
[19]
Nicholas Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15-68, 2000.
[20]
John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of the International Conference on Machine Learning, Williams College, Williamstown, MA, 2001.
[21]
Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In Proc. of the International Conference on Management of Data, Paris, France, 2004.
[22]
Jia Li, Robert M. Gray, and Richard A. Olshen. Multiresolution image classification by hierarchical modeling with two-dimensional hidden Markov models. IEEE Trans. on Information Theory, 46 (5):1826-1841, 2000.
[23]
Lin Liao, Dieter Fox, and Henry Kautz. Location-based activity recognition. In Advances in Neural Information Processing Systems, Whistler, Canada, 2005.
[24]
Dong C. Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45:503-528, 1989.
[25]
Ion Muslea, Steven Minton, and Craig A. Knoblock. Hierarchical wrapper induction for semi-structured information sources. Journal of Autonomous Agents and Multi-Agent, 4(1-2):93-114, 2001.
[26]
Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional random fields for object recognition. In Advances in Neural Information Processing Systems, Vancouver, Canada, 2004.
[27]
Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher. An introduction to exponential random graph (p*) model for social networks. Social Networks, 2006.
[28]
Ruihua Song, Ji-Rong Wen, and Wei-Ying Ma. Learning block importance models for web pages. In Proc. of the International World Wide Web Conference, Budapest, Hungary, 2004.
[29]
Amos J. Storkey and Christopher K. I. Williams. Image modeling with position-encoding dynamic trees. IEEE Trans. on Pattern Analysis and Machine Intelligence, 25(7):859-871, 2003.
[30]
Charles Sutton and Andrew McCallum. Piecewise training for undirected models. In Uncertainty in Artificial Intelligence, Edinburgh, Scotland, 2005.
[31]
Charles Sutton, Khashayar Rohanimanesh, and Andrew McCallum. Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In Proc. of the International Conference on Machine Learning, Banff, Canada, 2004.
[32]
Sinisa Todorovic and Michael C. Nechyba. Dynamic trees for unsupervised segmentation and matching of image regions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(11): 1762-1777, 2005.
[33]
Martin Wainwright, Tommi Jaakkola, and Alan Willsky. A new class of upper bounds on the log partition function. In Uncertainty in Artificial Intelligence, Alberta, Canada, 2002.
[34]
Max Welling and Geoffrey E. Hinton. A new learning algorithm for mean field boltzmann machines. In International Conference on Artificial Neural Networks, Vienna, Austria, 2001.
[35]
Max Welling and Charles Sutton. Learning in Markov random fields with contrastive free energies. In Artificial Intelligence and Statistics, Barbados, 2005.
[36]
Christopher K. I. Williams and Nicholas J. Adams. DTs: dynamic trees. In Advances in Neural Information Processing Systems, Denver, Colorado, USA, 1999.
[37]
Alan S. Willsky. Multiresolution Markov models for signal and image processing. In Proc. of the IEEE, 2002.
[38]
Alan L. Yuille. The convergence of contrastive divergence. In Advances in Neural Information Processing Systems, Vancouver, Canada, 2004.
[39]
Yanhong Zhai and Bing Liu. Web data extraction based on partial tree alignment. In Proc. of the International World Wide Web Conference, Chiba, Japan, 2005.
[40]
Hongkun Zhao, Weiyi Meng, Zonghuan Wu, Vijay Raghavan, and Clement Yu. Fully automatic wrapper generation for search engines. In Proc. of the International World Wide Web Conference, Chiba, Japan, 2005.
[41]
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Wei-Ying Ma. 2D conditional random fields for web information extraction. In Proc. of the International Conference on Machine Learning, Bonn, Germany, 2005.
[42]
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Wei-Ying Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proc. of the International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, 2006.
[43]
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Hsiao-Wuen Hon. Webpage understanding: an integrated approach. In Proc. of the International Conference on Knowledge Discovery and Data Mining, San Jose, CA, 2007a.
[44]
Jun Zhu, Zaiqing Nie, Bo Zhang, and Ji-Rong Wen. Dynamic hierarchical Markov random fields and their application to web data extraction. In Proc. of the International Conference on Machine Learning, Corvallis, OR, 2007b.

Cited By

View all
  • (2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
  • (2016)Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer ReviewsACM Transactions on Internet Technology10.1145/285705416:2(1-17)Online publication date: 15-Apr-2016
  • (2015)QuietProceedings of the 24th International Conference on Artificial Intelligence10.5555/2832581.2832736(3497-3503)Online publication date: 25-Jul-2015
  • Show More Cited By
  1. Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image The Journal of Machine Learning Research
    The Journal of Machine Learning Research  Volume 9, Issue
    6/1/2008
    1964 pages
    ISSN:1532-4435
    EISSN:1533-7928
    Issue’s Table of Contents

    Publisher

    JMLR.org

    Publication History

    Published: 01 June 2008
    Published in JMLR Volume 9

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
    • (2016)Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer ReviewsACM Transactions on Internet Technology10.1145/285705416:2(1-17)Online publication date: 15-Apr-2016
    • (2015)QuietProceedings of the 24th International Conference on Artificial Intelligence10.5555/2832581.2832736(3497-3503)Online publication date: 25-Jul-2015
    • (2015)Annotating Needles in the Haystack without LookingProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2788580(2257-2266)Online publication date: 10-Aug-2015
    • (2014)Effective Active Learning Strategies for the Use of Large-Margin Classifiers in Semantic AnnotationINFORMS Journal on Computing10.5555/2700977.270098126:3(461-483)Online publication date: 1-Aug-2014
    • (2012)Learning to adapt cross language information extraction wrapperApplied Intelligence10.1007/s10489-011-0305-036:4(918-931)Online publication date: 1-Jun-2012
    • (2011)Towards a top-down and bottom-up bidirectional approach to joint information extractionProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063699(847-856)Online publication date: 24-Oct-2011
    • (2010)2D correlative-chain conditional random fields for semantic annotation of web objectsJournal of Computer Science and Technology10.5555/1946459.194646925:4(761-770)Online publication date: 1-Jul-2010
    • (2009)Maximum Entropy Discrimination Markov NetworksThe Journal of Machine Learning Research10.5555/1577069.175587110(2531-2569)Online publication date: 1-Dec-2009
    • (2009)Primal sparse Max-margin Markov networksProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1557019.1557132(1047-1056)Online publication date: 28-Jun-2009
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media