Abstract
With the blooming development of social network, Internet turns into the most widely information source. However, there are a large amount of duplicated web pages most of which are from being reprinted. Border et al. used to do an experiment on a collection of 30,000,000 HTML and text documents. It turned out that nearly 18 % of the pages are exactly the same and 41 % of the pages share 51 % similarity. These replicas of web pages has brought a major burden for the search engines and affecting the performance of the search engines badly. So elimination of duplicated web pages has become a very hot spot in information retrieval field in these years. In this paper, we have proposed a function word(FW) based approach which involves the concept of Bloom Filter(BF) to eliminate duplicated web pages without extracting the web main text. Our approach involves three separate stages. Stage 1 is to extract sample text according to function words feature in web pages. In stage 2, the feature code is extracted using function words. In stage 3, the duplicated web pages would be eliminated by similarity calculation of their BloomFilters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Weng, Y.: Research on NLP-based duplicated web pages detection algorithm. Beijing University of Posts and Telecommunications (2009)
Yang, H., et al.: Eliminated duplicate search web pages with Hash algorithm. Control Autom. 27, 299–301 (2006)
Ding, Z., et al.: Research of large-scale URL filter based on Bloom filter. New Technol. Libr. Inf. Serv. 3, 45–50 (2008)
Zhang, J., et al.: A study of the identification of authorship for Chinese texts. In: IEEE International Conference on Intelligence and Security Informatics, ISI 2008, pp. 263–264 (2008)
Ding, J., et al.: Existential state and presentation of Chinese style. Rhetor. Learn. 3, 1–6 (2006)
Xu, N., et al.: BloomFilter based duplicated webpage elimination approach. Microcomput. Appl. 27(3), 48–51 (2011)
Yang, H., Callan, J.: Near-duplicate detection for eRulemaking. In: National Conference on Digital Government Research. Digital Government Society of North America, pp. 78–86 (2005)
Ma, L., Xia, Z.: An FW-DTSS based approach for news page information extraction. In: Tan, Y., Shi, Y. (eds.) DMBD 2016. LNCS, vol. 9714, pp. 227–234. Springer, Heidelberg (2016). doi:10.1007/978-3-319-40973-3_22
Mitzenmacher, M.: Compressed Bloom filters. IEEE/ACM Trans. Netw. 10(5), 604–612 (2002)
BloomFilter concepts and principles. http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
Laber, E.S., et al.: A fast and simple method for extracting relevant content from News web pages. In: Proceedings of CIKM, pp. 1685–1688 (2009)
Xia, Z., Bu, Z.: Community detection based on a semantic network. Knowl. Based Syst. 26, 30–39 (2012)
Bu, Z., Xia, Z.: A last updating evolution model for online social networks. Phys. A Stat. Mech. Appl. 392(9), 2240–2247 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Ma, L., Xia, Z. (2016). An FW-BF Based Approach on Elimination of Duplicated Web Pages. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2016. IDEAL 2016. Lecture Notes in Computer Science(), vol 9937. Springer, Cham. https://doi.org/10.1007/978-3-319-46257-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-46257-8_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46256-1
Online ISBN: 978-3-319-46257-8
eBook Packages: Computer ScienceComputer Science (R0)