An FW-BF Based Approach on Elimination of Duplicated Web Pages

Leiming Ma²¹ &
Zhengyou Xia²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9937))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

1819 Accesses

Abstract

With the blooming development of social network, Internet turns into the most widely information source. However, there are a large amount of duplicated web pages most of which are from being reprinted. Border et al. used to do an experiment on a collection of 30,000,000 HTML and text documents. It turned out that nearly 18 % of the pages are exactly the same and 41 % of the pages share 51 % similarity. These replicas of web pages has brought a major burden for the search engines and affecting the performance of the search engines badly. So elimination of duplicated web pages has become a very hot spot in information retrieval field in these years. In this paper, we have proposed a function word(FW) based approach which involves the concept of Bloom Filter(BF) to eliminate duplicated web pages without extracting the web main text. Our approach involves three separate stages. Stage 1 is to extract sample text according to function words feature in web pages. In stage 2, the feature code is extracted using function words. In stage 3, the duplicated web pages would be eliminated by similarity calculation of their BloomFilters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

A Bloom Filter-Based Data Deduplication for Big Data

A distributed incremental information acquisition model for large-scale text data

Article 21 December 2017

References

Weng, Y.: Research on NLP-based duplicated web pages detection algorithm. Beijing University of Posts and Telecommunications (2009)
Google Scholar
Yang, H., et al.: Eliminated duplicate search web pages with Hash algorithm. Control Autom. 27, 299–301 (2006)
Google Scholar
Ding, Z., et al.: Research of large-scale URL filter based on Bloom filter. New Technol. Libr. Inf. Serv. 3, 45–50 (2008)
Google Scholar
Zhang, J., et al.: A study of the identification of authorship for Chinese texts. In: IEEE International Conference on Intelligence and Security Informatics, ISI 2008, pp. 263–264 (2008)
Google Scholar
Ding, J., et al.: Existential state and presentation of Chinese style. Rhetor. Learn. 3, 1–6 (2006)
Google Scholar
Xu, N., et al.: BloomFilter based duplicated webpage elimination approach. Microcomput. Appl. 27(3), 48–51 (2011)
Google Scholar
Yang, H., Callan, J.: Near-duplicate detection for eRulemaking. In: National Conference on Digital Government Research. Digital Government Society of North America, pp. 78–86 (2005)
Google Scholar
Ma, L., Xia, Z.: An FW-DTSS based approach for news page information extraction. In: Tan, Y., Shi, Y. (eds.) DMBD 2016. LNCS, vol. 9714, pp. 227–234. Springer, Heidelberg (2016). doi:10.1007/978-3-319-40973-3_22
Chapter Google Scholar
Mitzenmacher, M.: Compressed Bloom filters. IEEE/ACM Trans. Netw. 10(5), 604–612 (2002)
Article MATH Google Scholar
BloomFilter concepts and principles. http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
Laber, E.S., et al.: A fast and simple method for extracting relevant content from News web pages. In: Proceedings of CIKM, pp. 1685–1688 (2009)
Google Scholar
Xia, Z., Bu, Z.: Community detection based on a semantic network. Knowl. Based Syst. 26, 30–39 (2012)
Article Google Scholar
Bu, Z., Xia, Z.: A last updating evolution model for online social networks. Phys. A Stat. Mech. Appl. 392(9), 2240–2247 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
Leiming Ma & Zhengyou Xia

Authors

Leiming Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyou Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhengyou Xia .

Editor information

Editors and Affiliations

University of Manchester, Manchester, United Kingdom
Hujun Yin
Nanjing University, Nanjing, China
Yang Gao
Yangzhou University, Yangzhou, Jiangsu, China
Bin Li
Aeronautics and Astronautics, Nanjing University Aeronautics and Astronautics, Nanjing, China
Daoqiang Zhang
Nanjing Normal University, Nanjing, China
Ming Yang
Yangzhou University, Yangzhou, Jiangsu, China
Yun Li
Ostfalia University of Applied Sciences, Wolfenbüttel, Germany
Frank Klawonn
University of Seville, Seville, Spain
Antonio J. Tallón-Ballesteros

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, L., Xia, Z. (2016). An FW-BF Based Approach on Elimination of Duplicated Web Pages. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2016. IDEAL 2016. Lecture Notes in Computer Science(), vol 9937. Springer, Cham. https://doi.org/10.1007/978-3-319-46257-8_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-46257-8_20
Published: 13 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46256-1
Online ISBN: 978-3-319-46257-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An FW-BF Based Approach on Elimination of Duplicated Web Pages

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

A Bloom Filter-Based Data Deduplication for Big Data

A distributed incremental information acquisition model for large-scale text data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

An FW-BF Based Approach on Elimination of Duplicated Web Pages

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain

A Bloom Filter-Based Data Deduplication for Big Data

A distributed incremental information acquisition model for large-scale text data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation