Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1135777.1135901acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Detecting semantic cloaking on the web

Published: 23 May 2006 Publication History

Abstract

By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.

References

[1]
A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, O. Sercinoglu, and S. Tong. Information retrieval based on historical data, Mar. 31 2005. US Patent Application number 20050071741.
[2]
America Online, Inc. AOL Search: Hot searches, Mar. 2005. http://hot.aol.com/hot/hot.
[3]
E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: Detecting site functionality by structural patterns. In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia, pages 3847, Aug 2003.
[4]
AskJeeves / Teoma Site Submit managed by ineedhits.com: Program Terms, 2005. Online at http://ask.ineedhits.com/programterms.asp.
[5]
Ask Jeeves, Inc. Ask Jeeves About, Mar. 2005. http://sp.ask.com/docs/about/jeevesiq.html.
[6]
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 104111, Melbourne, AU, 1998.
[7]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(17):107117, 1998.
[8]
M. Cafarella and D. Cutting. Building Nutch: Open source. Queue, 2(2):5461, Apr. 2004.
[9]
S. Chakrabarti, M. Joshi, K. Punera, and D. Pennock. The structure of broad topics on the web. In Proceedings of 11th International World Wide Web Conference, pages 251262, Honolulu, Hawaii, US, 2002. ACM Press.
[10]
S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced topic distillation using text, markup tags, and SIGIR International Conference on Research & Development in Information Retrieval, pages 208--216, 2001.
[11]
P. Chirita, W. Nejdl, R. Paiu, and C. Kohlschutter. Using ODP metadata to personalize search. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 178185, Salvador, Brazil, August 2005.
[12]
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of the Twenty-sixth International Conference on Very Large Databases (VLDB), 2000.
[13]
B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.
[14]
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proceedings of European Conference on Machine Learning, pages 96--107, Oct. 2005.
[15]
D. Fetterly, M. Manasse, and M. Najork. A large-scale study of the evolution of web pages. In Proceedings of the 12th International World Wide Web Conference, pages 669--678, Budapest, Hungary, May 2003.
[16]
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of WebDB, pages 16, June 2004.
[17]
Google, Inc. Google information for webmasters, 2005. Online at http://www.google.com/webmasters/faq.html.
[18]
Google, Inc. Google Zeitgeist, Jan. 2005. http://www.google.com/press/zeitgeist/zeitgeist-jan05.html.
[19]
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
[20]
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 271--279, Toronto, Canada, Sept. 2004.
[21]
T. Haveliwala. Topic-sensitive PageRank. In Proceedings of the Eleventh International World Wide Web Conference, pages 517--526, Honolulu, Hawaii, May 2002.
[22]
M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):1122, Fall 2002.
[23]
T. Joachims. Making large-scale support vector machine learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.
[24]
T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 133--142, 2002.
[25]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.
[26]
R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(16):387401, 2000.
[27]
Lycos. Lycos 50 with Dean: 2004 web's most wanted, Dec. 2004. http://50.lycos.com/121504.asp.
[28]
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
[29]
M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number 6,910,077.
[30]
A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The evolution of the web from a search engine perspective. In Proceedings of 13th International World Wide Web Conference, pages 112, New York City, USA, May 2004.
[31]
Open Directory Project, 2005. http://dmoz.org/.
[32]
Open Directory RDF Dump, 2005. http://rdf.dmoz.org/.
[33]
A. Perkins. White paper: The classification of search engine spam, Sept. 2001. Online at http://www.silverdisc.co.uk/articles/spam-classification/.
[34]
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kau man, San Mateo, CA, 1993.
[35]
C. Silverstein, M. Henginger, J. Marais, and M. Moricz. Analysis of a very large AltaVista query log. SIGIR Forum, 33:612, 1999.
[36]
A. Westbrook and R. Greene. Using semantic analysis to classify search engine spam, Dec. 2002. Class project report at http://www.stanford.edu/class/cs276a/projects/reports/.
[37]
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, second edition, 2005.
[38]
B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), May 2005.
[39]
B. Wu and B. D. Davison. Identifying link farm spam pages. In Proceedings of the 14th International World Wide Web Conference, pages 820829, Chiba, Japan, May 2005.
[40]
J. Xu, Y. Cao, H. Li, and M. Zhao. Ranking definitions with supervised learning methods. In Proceedings of the 14th International World Wide Web Conference, pages 811--819, May 2005.
[41]
Yahoo! Inc. Yahoo! Help - Yahoo! Search, 2005. Online at http://help.yahoo.com/help/us/ysearch/deletions/.
[42]
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusions. In Proceedings of the Third Workshop on Algorithms and Models for the Web Graph, Oct. 2004.

Cited By

View all
  • (2023)The Chameleon on the Web: an Empirical Study of the Insidious Proactive Web DefacementsProceedings of the ACM Web Conference 202310.1145/3543507.3583377(2241-2251)Online publication date: 30-Apr-2023
  • (2023)Stargazer: Long-term and Multiregional Measurement of Timing/Geolocation-based CloakingIEEE Access10.1109/ACCESS.2023.3280815(1-1)Online publication date: 2023
  • (2022)How gullible are web measurement tools?Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies10.1145/3555050.3569131(171-186)Online publication date: 30-Nov-2022
  • Show More Cited By

Index Terms

  1. Detecting semantic cloaking on the web

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '06: Proceedings of the 15th international conference on World Wide Web
    May 2006
    1102 pages
    ISBN:1595933239
    DOI:10.1145/1135777
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 May 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. spam
    2. web search engine

    Qualifiers

    • Article

    Conference

    WWW06
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 23 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)The Chameleon on the Web: an Empirical Study of the Insidious Proactive Web DefacementsProceedings of the ACM Web Conference 202310.1145/3543507.3583377(2241-2251)Online publication date: 30-Apr-2023
    • (2023)Stargazer: Long-term and Multiregional Measurement of Timing/Geolocation-based CloakingIEEE Access10.1109/ACCESS.2023.3280815(1-1)Online publication date: 2023
    • (2022)How gullible are web measurement tools?Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies10.1145/3555050.3569131(171-186)Online publication date: 30-Nov-2022
    • (2020)GT2FS-SMOTE: An Intelligent Oversampling Approach Based Upon General Type-2 Fuzzy Sets to Detect Web SpamArabian Journal for Science and Engineering10.1007/s13369-020-04995-546:4(3033-3050)Online publication date: 15-Oct-2020
    • (2019)On the Perils of Leaking Referrers in Online Collaboration ServicesDetection of Intrusions and Malware, and Vulnerability Assessment10.1007/978-3-030-22038-9_4(67-85)Online publication date: 6-Jun-2019
    • (2018)Web Spam DetectionEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_465(4677-4681)Online publication date: 7-Dec-2018
    • (2017)Exploring the dynamics of search advertiser fraudProceedings of the 2017 Internet Measurement Conference10.1145/3131365.3131393(157-170)Online publication date: 1-Nov-2017
    • (2017)Knowledge Base Smarter Articulations for the Open Directory Project in a Sustainable Digital EcosystemProceedings of the 26th International Conference on World Wide Web Companion10.1145/3041021.3054769(1537-1545)Online publication date: 3-Apr-2017
    • (2017)Web Spam DetectionEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_465-3(1-5)Online publication date: 11-Feb-2017
    • (2016)Measurement of IP and network tracking behaviour of malicious websitesProceedings of the Australasian Computer Science Week Multiconference10.1145/2843043.2843358(1-8)Online publication date: 1-Feb-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media