计算机科学 ›› 2015, Vol. 42 ›› Issue (5): 109-113.doi: 10.11896/j.issn.1002-137X.2015.05.022
王亚普,王志坚,叶 枫
WANG Ya-pu, WANG Zhi-jian and YE Feng
摘要: 相似度计算是文本挖掘的基础,也是信息提取过程的关键步骤。对于结构复杂的网页,当前基于传统树路径模型的相似度计算方法在准确性上尚不完善。传统树路径模型未考虑路径出现的先后顺序,并且比较路径相似度时用的是完全匹配,难以在不完全匹配时更精确地描述路径之间的相似度。因此,从网页结构相似度入手,提出了一种改进的树路径模型。该模型充分考虑了兄弟节点之间的关系、路径位置以及路径权重,弥补了传统树路径模型无法表达文档结构和层次信息的缺陷。实验结果表明,该模型提高了识别网页结构相似性的能力,既能对结构差别较大的网页进行良好的区分,又能较好地反映来自同一模板的网页之间的差异性,同时在网页聚类中具有更优的效果。
[1] Li Yan-heng.The XML-based Information Extraction on Data-intensive Page[C]∥IFIP International Conference onNetwork and Parallel Computing Workshops,2007.NPC Workshops,IEEE,2007:1027-1030 [2] Li R,Pei C,Zheng J.Web Information Extraction Based on Hybrid Conditional Model[C]∥2010 Second International Workshop on Education Technology and Computer Science (ETCS).IEEE,2010,1:137-140 [3] 何昕,谢志鹏.基于简单树匹配算法的 Web 页面结构相似性度量[J].计算机研究与发展,2007(z3):1-6 [4] Tai K C.The tree-to-tree correction problem[J].Journal of the ACM (JACM),1979,26(3):422-433 [5] Cruz I F,Borisov S,Marks M A,et al.Measuring structural simi-larity among Web documents:preliminary results[M]∥Electronic Publishing,Artistic Imaging,and Digital Typography.Springer Berlin Heidelberg,1998:513-524 [6] Joshi S,Agrawal N,Krishnapuram R,et al.A bag of paths modelfor measuring structural similarity in Web documents[C]∥Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2003:577-582 [7] 王志琪,王永成.HTML 文件的文本信息预处理技术[J].计算机工程,2006,32(5):46-48 [8] Gupta S,Kaiser G,Neistadt D,et al.DOM-based content extraction of HTML documents[C]∥Proceedings of the 12th International Conference on World Wide Web.ACM,2003:207-214 [9] Bajcsy P,Ahuja N.Location-and density-based hierarchical clustering using similarity analysis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(9):1011-1015 [10] Han J,Kamber M,Pei J.Data Mining:Concepts and Techniques (Third Edition)[M].Thailand:Elsevier Pte Ltd,2012:297-302 [11] McCarthy J F,Lehnert W G.Using decision trees for conference resolution[C]∥The Fourteenth International Joint Conference on Artificial Intelligence.1995:109-114 |
No related articles found! |
|