一种改进的树路径模型在网页聚类中的研究

计算机科学 ›› 2015, Vol. 42 ›› Issue (5): 109-113.doi: 10.11896/j.issn.1002-137X.2015.05.022

• 2014' 数据挖掘会议 • 上一篇下一篇

一种改进的树路径模型在网页聚类中的研究

王亚普,王志坚,叶枫

河海大学计算机与信息学院南京211100,河海大学计算机与信息学院南京211100;南京航空航天大学计算机科学与技术学院南京210016,河海大学计算机与信息学院南京211100;南京航空航天大学计算机科学与技术学院南京210016

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受江苏水利科技项目:“智慧河流”研究及其在六合滁河管理中的应用(2013025),河海大学中央高校基本科研业务费项目(2009B21614)资助

Research of Improved Tree Path Model in Web Page Clustering

WANG Ya-pu, WANG Zhi-jian and YE Feng

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 相似度计算是文本挖掘的基础,也是信息提取过程的关键步骤。对于结构复杂的网页,当前基于传统树路径模型的相似度计算方法在准确性上尚不完善。传统树路径模型未考虑路径出现的先后顺序,并且比较路径相似度时用的是完全匹配,难以在不完全匹配时更精确地描述路径之间的相似度。因此,从网页结构相似度入手,提出了一种改进的树路径模型。该模型充分考虑了兄弟节点之间的关系、路径位置以及路径权重,弥补了传统树路径模型无法表达文档结构和层次信息的缺陷。实验结果表明,该模型提高了识别网页结构相似性的能力,既能对结构差别较大的网页进行良好的区分,又能较好地反映来自同一模板的网页之间的差异性,同时在网页聚类中具有更优的效果。

关键词: 信息提取,网页结构,相似度,树路径模型,聚类

Abstract: Computing the similarity is the basis of text mining,and also the crucial step of information extraction.When tackling the Web pages with complex structure,the accuracy of computing the similarity based on traditional tree path model is not perfect.Traditional tree path model will not take the sequence of the paths in consideration and compare the similarity of paths by using perfect matching.It cannot describe the similarity between paths accurately when it is not a perfect matching.Therefore,the paper introduced the structural similarity Web at first,and then proposed a tree path model.This model takes fully account of the relationship between the siblings,the path location and the path weights,and makes up for the defect of the traditional tree path model which cannot express both document structure and hierarchical information.The experiment result shows that the model improves the recognition ability of Web pages structural similarity.It not only can better distinguish the Web pages which have large structure difference,but also effectively reflects the difference between the Web pages with the same template,also has a better effect in the Web page clustering.

Key words: Information extraction,Web page structure,Similarity,Tree path model,Clustering

王亚普,王志坚,叶枫. 一种改进的树路径模型在网页聚类中的研究[J]. 计算机科学, 2015, 42(5): 109-113. https://doi.org/10.11896/j.issn.1002-137X.2015.05.022

WANG Ya-pu, WANG Zhi-jian and YE Feng. Research of Improved Tree Path Model in Web Page Clustering[J]. Computer Science, 2015, 42(5): 109-113. https://doi.org/10.11896/j.issn.1002-137X.2015.05.022

参考文献

[1] Li Yan-heng.The XML-based Information Extraction on Data-intensive Page[C]∥IFIP International Conference onNetwork and Parallel Computing Workshops,2007.NPC Workshops,IEEE,2007:1027-1030
[2] Li R,Pei C,Zheng J.Web Information Extraction Based on Hybrid Conditional Model[C]∥2010 Second International Workshop on Education Technology and Computer Science (ETCS).IEEE,2010,1:137-140
[3] 何昕,谢志鹏.基于简单树匹配算法的 Web 页面结构相似性度量[J].计算机研究与发展,2007(z3):1-6
[4] Tai K C.The tree-to-tree correction problem[J].Journal of the ACM (JACM),1979,26(3):422-433
[5] Cruz I F,Borisov S,Marks M A,et al.Measuring structural simi-larity among Web documents:preliminary results[M]∥Electronic Publishing,Artistic Imaging,and Digital Typography.Springer Berlin Heidelberg,1998:513-524
[6] Joshi S,Agrawal N,Krishnapuram R,et al.A bag of paths modelfor measuring structural similarity in Web documents[C]∥Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2003:577-582
[7] 王志琪,王永成.HTML 文件的文本信息预处理技术[J].计算机工程,2006,32(5):46-48
[8] Gupta S,Kaiser G,Neistadt D,et al.DOM-based content extraction of HTML documents[C]∥Proceedings of the 12th International Conference on World Wide Web.ACM,2003:207-214
[9] Bajcsy P,Ahuja N.Location-and density-based hierarchical clustering using similarity analysis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(9):1011-1015
[10] Han J,Kamber M,Pei J.Data Mining:Concepts and Techniques (Third Edition)[M].Thailand:Elsevier Pte Ltd,2012:297-302
[11] McCarthy J F,Lehnert W G.Using decision trees for conference resolution[C]∥The Fourteenth International Joint Conference on Artificial Intelligence.1995:109-114

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed