article

Free access

Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

Authors:

Ji-Rong WenAuthors Info & Claims

The Journal of Machine Learning Research, Volume 9

Pages 1583 - 1614

Published: 01 June 2008 Publication History

Abstract

Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies---attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model's parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models.

References

[1]

Nicholas J. Adams and Christopher K. I. Williams. SDTs: Sparse dynamic trees. In Artificial Neural Networks, 1999.

[2]

Arwind Arasu and Hector Garcia-Molina. Extracting structured data from webpages. In Proc. of the International Conference on Management of Data, San Diego, CA, 2003.

Digital Library

[3]

David Buttler, Ling Liu, and Calton Pu. A fully automated object extraction system for the world wide web. In Proc. of International Conference on Distributed Computing Systems, Arizona, USA, 2001.

Digital Library

[4]

Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Block-based web search. In Proc. of the Internaltinoal Conference on Information Retrieval, Sheffield, UK, 2004.

Digital Library

[5]

Miguel A. Carreira-Perpinan and Geoffrey E. Hinton. On contrastive divergence learning. In Proc. of Artificial Intelligence and Statistics, Barbados, 2005.

[6]

Chia-Hui Chang and Shao-Chen Lui. IEPAD: Information extraction based on pattern discovery. In Proc. of the International World Wide Web Conference, Hong Kong, China, 2001.

Digital Library

[7]

Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer, New York, NY, 1999.

Digital Library

[8]

Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. ROADRUNNER: Towards automatic data extraction from large web sites. In Proc. of the Conference on Very Large Data Bases, Rome, Italy, 2001.

Digital Library

[9]

Aron Culotta, Trausti Kristjansson, Andrew McCallum, and Paul Viola. Corrective feedback and persistent learning for information extraction. Artificial Intelligence Journal, 170(14):1101- 1122, 2006.

Digital Library

[10]

David W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. In Proc. of the International Conference on Management of Data, Philadephia, PA, 1999.

Digital Library

[11]

Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krupl, and Bernhard Pollak. Towards domain-independent information extraction from web tables. In Proc. of the International World Wide Web Conference, Banff, Canada, 2007.

Digital Library

[12]

Lise Getoor, Nir Friedman, Daphne Koller, and Benjamin Taskar. Learning probabilistic models of relational structure. In Proc. of the International Conference on Machine Learning, Williams College, Williamstown, MA, 2001.

Digital Library

[13]

Xuming He, Richard S. Zemel, and Miguel A. Carreira-Perpinan. Multiscale conditional random fields for image labeling. In IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, 2004.

Digital Library

[14]

Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800, 2002.

Digital Library

[15]

William W. Irving, Paul W. Fieguth, and Alan S. Willsky. An overlapping tree approach to multi-scale stochastic modeling and estimation. IEEE Trans. on Image Processing, 6(11):1517-1529, 1997.

Digital Library

[16]

Michael I. Jordan, Zoubin Ghahramani, Tommis Jaakkola, and Lawrence K. Saul. An Introduction to Variational Methods for Graphical Models. M. I. Jordan (Ed.), Learning in Graphical Models, Cambridge: MIT Press, Cambridge, MA, 1999.

Digital Library

[17]

Zoltan Kato, Marc Berthod, and Josiane Zerubia. Multiscale Markov random field models for parallel image classification. In IEEE International Conference on Computer Vision, Berlin, Germany, 1993.

[18]

Sanjiv Kumar and Martial Hebert. A hierarchical field framework for unified context-based classification. In IEEE International Conference on Computer Vision, Beijing, China, 2005.

Digital Library

[19]

Nicholas Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15-68, 2000.

Digital Library

[20]

John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of the International Conference on Machine Learning, Williams College, Williamstown, MA, 2001.

Digital Library

[21]

Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In Proc. of the International Conference on Management of Data, Paris, France, 2004.

Digital Library

[22]

Jia Li, Robert M. Gray, and Richard A. Olshen. Multiresolution image classification by hierarchical modeling with two-dimensional hidden Markov models. IEEE Trans. on Information Theory, 46 (5):1826-1841, 2000.

Digital Library

[23]

Lin Liao, Dieter Fox, and Henry Kautz. Location-based activity recognition. In Advances in Neural Information Processing Systems, Whistler, Canada, 2005.

[24]

Dong C. Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45:503-528, 1989.

Digital Library

[25]

Ion Muslea, Steven Minton, and Craig A. Knoblock. Hierarchical wrapper induction for semi-structured information sources. Journal of Autonomous Agents and Multi-Agent, 4(1-2):93-114, 2001.

Digital Library

[26]

Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional random fields for object recognition. In Advances in Neural Information Processing Systems, Vancouver, Canada, 2004.

[27]

Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher. An introduction to exponential random graph (p*) model for social networks. Social Networks, 2006.

[28]

Ruihua Song, Ji-Rong Wen, and Wei-Ying Ma. Learning block importance models for web pages. In Proc. of the International World Wide Web Conference, Budapest, Hungary, 2004.

Digital Library

[29]

Amos J. Storkey and Christopher K. I. Williams. Image modeling with position-encoding dynamic trees. IEEE Trans. on Pattern Analysis and Machine Intelligence, 25(7):859-871, 2003.

Digital Library

[30]

Charles Sutton and Andrew McCallum. Piecewise training for undirected models. In Uncertainty in Artificial Intelligence, Edinburgh, Scotland, 2005.

[31]

Charles Sutton, Khashayar Rohanimanesh, and Andrew McCallum. Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In Proc. of the International Conference on Machine Learning, Banff, Canada, 2004.

Digital Library

[32]

Sinisa Todorovic and Michael C. Nechyba. Dynamic trees for unsupervised segmentation and matching of image regions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(11): 1762-1777, 2005.

Digital Library

[33]

Martin Wainwright, Tommi Jaakkola, and Alan Willsky. A new class of upper bounds on the log partition function. In Uncertainty in Artificial Intelligence, Alberta, Canada, 2002.

Digital Library

[34]

Max Welling and Geoffrey E. Hinton. A new learning algorithm for mean field boltzmann machines. In International Conference on Artificial Neural Networks, Vienna, Austria, 2001.

Digital Library

[35]

Max Welling and Charles Sutton. Learning in Markov random fields with contrastive free energies. In Artificial Intelligence and Statistics, Barbados, 2005.

[36]

Christopher K. I. Williams and Nicholas J. Adams. DTs: dynamic trees. In Advances in Neural Information Processing Systems, Denver, Colorado, USA, 1999.

Digital Library

[37]

Alan S. Willsky. Multiresolution Markov models for signal and image processing. In Proc. of the IEEE, 2002.

[38]

Alan L. Yuille. The convergence of contrastive divergence. In Advances in Neural Information Processing Systems, Vancouver, Canada, 2004.

[39]

Yanhong Zhai and Bing Liu. Web data extraction based on partial tree alignment. In Proc. of the International World Wide Web Conference, Chiba, Japan, 2005.

Digital Library

[40]

Hongkun Zhao, Weiyi Meng, Zonghuan Wu, Vijay Raghavan, and Clement Yu. Fully automatic wrapper generation for search engines. In Proc. of the International World Wide Web Conference, Chiba, Japan, 2005.

Digital Library

[41]

Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Wei-Ying Ma. 2D conditional random fields for web information extraction. In Proc. of the International Conference on Machine Learning, Bonn, Germany, 2005.

Digital Library

[42]

Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Wei-Ying Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proc. of the International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, 2006.

Digital Library

[43]

Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and Hsiao-Wuen Hon. Webpage understanding: an integrated approach. In Proc. of the International Conference on Knowledge Discovery and Data Mining, San Jose, CA, 2007a.

Digital Library

[44]

Jun Zhu, Zaiqing Nie, Bo Zhang, and Ji-Rong Wen. Dynamic hierarchical Markov random fields and their application to web data extraction. In Proc. of the International Conference on Machine Learning, Corvallis, OR, 2007b.

Digital Library

Cited By

Parthasarathy SPattanaik LKhatry AIyer ARadhakrishna ARajamani SRaza MJhala RDillig I(2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523705
Bing LWong TLam W(2016)Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer ReviewsACM Transactions on Internet Technology10.1145/285705416:2(1-17)Online publication date: 15-Apr-2016
https://dl.acm.org/doi/10.1145/2857054
Fujiwara YShasha D(2015)QuietProceedings of the 24th International Conference on Artificial Intelligence10.5555/2832581.2832736(3497-3503)Online publication date: 25-Jul-2015
https://dl.acm.org/doi/10.5555/2832581.2832736
Show More Cited By

Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
1. Computing methodologies

Recommendations

Hierarchical hidden conditional random fields for information extraction
LION'05: Proceedings of the 5th international conference on Learning and Intelligent Optimization

Hidden Markov Models (HMMs) are very popular generative models for time series data. Recent work, however, has shown that for many tasks Conditional Random Fields (CRFs), a type of discriminative model, perform better than HMMs. Information extraction ...
Dynamic hierarchical Markov random fields and their application to web data extraction
ICML '07: Proceedings of the 24th international conference on Machine learning

Hierarchical models have been extensively studied in various domains. However, existing models assume fixed model structures or incorporate structural uncertainty generatively. In this paper, we propose Dynamic Hierarchical Markov Random Fields (DHMRFs) ...
Factorial Markov Random Fields
ECCV '02: Proceedings of the 7th European Conference on Computer Vision-Part III

In this paper we propose an extension to the standard Markov Random Field (MRF) model in order to handle layers. Our extension, which we call a Factorial MRF (FMRF), is analogous to the extension from Hidden Markov Models (HMM's) to Factorial HMM's. We ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 9, Issue

6/1/2008

1964 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 June 2008

Published in JMLR Volume 9

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
397
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)5

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Parthasarathy SPattanaik LKhatry AIyer ARadhakrishna ARajamani SRaza MJhala RDillig I(2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523705
Bing LWong TLam W(2016)Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer ReviewsACM Transactions on Internet Technology10.1145/285705416:2(1-17)Online publication date: 15-Apr-2016
https://dl.acm.org/doi/10.1145/2857054
Fujiwara YShasha D(2015)QuietProceedings of the 24th International Conference on Artificial Intelligence10.5555/2832581.2832736(3497-3503)Online publication date: 25-Jul-2015
https://dl.acm.org/doi/10.5555/2832581.2832736
Zhang WAhmed AYang JJosifovski VSmola ACao LZhang CJoachims TWebb GMargineantu DWilliams G(2015)Annotating Needles in the Haystack without LookingProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2788580(2257-2266)Online publication date: 10-Aug-2015
https://dl.acm.org/doi/10.1145/2783258.2788580
(2014)Effective Active Learning Strategies for the Use of Large-Margin Classifiers in Semantic AnnotationINFORMS Journal on Computing10.5555/2700977.270098126:3(461-483)Online publication date: 1-Aug-2014
https://dl.acm.org/doi/10.5555/2700977.2700981
Wong T(2012)Learning to adapt cross language information extraction wrapperApplied Intelligence10.1007/s10489-011-0305-036:4(918-931)Online publication date: 1-Jun-2012
https://dl.acm.org/doi/10.1007/s10489-011-0305-0
Yu XKing ILyu M(2011)Towards a top-down and bottom-up bidirectional approach to joint information extractionProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063699(847-856)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.1145/2063576.2063699
Ding YLi QDong YPeng Z(2010)2D correlative-chain conditional random fields for semantic annotation of web objectsJournal of Computer Science and Technology10.5555/1946459.194646925:4(761-770)Online publication date: 1-Jul-2010
https://dl.acm.org/doi/10.5555/1946459.1946469
Zhu JXing E(2009)Maximum Entropy Discrimination Markov NetworksThe Journal of Machine Learning Research10.5555/1577069.175587110(2531-2569)Online publication date: 1-Dec-2009
https://dl.acm.org/doi/10.5555/1577069.1755871
Zhu JXing EZhang BElder JFogelman FFlach PZaki M(2009)Primal sparse Max-margin Markov networksProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1557019.1557132(1047-1056)Online publication date: 28-Jun-2009
https://dl.acm.org/doi/10.1145/1557019.1557132
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents