Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1531914.1531922acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

Linked latent Dirichlet allocation in web spam filtering

Published: 21 April 2009 Publication History

Abstract

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA technique takes also linkage into account: topics are propagated along links in such a way that the linked document directly influences the words in the linking document. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. We test linked LDA on the WEBSPAM-UK2007 corpus. By using BayesNet classifier, in terms of the AUC of classification, we achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. Our method even slightly improves over the best Web Spam Challenge 2008 result.

References

[1]
J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
[2]
I. Bíró, J. Szabó, and A. A. Benczúr. Latent Dirichlet Allocation in Web Spam Filtering. manuscript, 2008.
[3]
I. Bíró, J. Szabó, and A. A. Benczúr. Very Large Scale Link Based Latent Dirichlet Allocation for Web Document Classification. manuscript, http://www.ilab.sztaki.hu/~ibiro/linkedLDA/, 2009.
[4]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993--1022, 2003.
[5]
A. Bratko, B. Filipič, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673--2698, 2006.
[6]
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423--430, 2007.
[7]
D. Cohn and T. Hofmann. The Missing Link-A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems, pages 430--436, 2001.
[8]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
[9]
L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences. In Proceedings of the 24th international conference on Machine learning, pages 233--240. ACM Press New York, NY, USA, 2007.
[10]
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications, 2004.
[11]
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics -- Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1--6, Paris, France, 2004.
[12]
D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005.
[13]
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1):5228--5235, 2004.
[14]
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
[15]
G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, 2004.
[16]
M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002.
[17]
T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42(1):177--196, 2001.
[18]
Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007.
[19]
T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. Proc. of the 29th international ACM SIGIR conference on Research and development in information retrieval, pages 123--130, 2006.
[20]
R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Joint Latent Topic Models for Text and Citations. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA, 2008.
[21]
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83--92, Edinburgh, Scotland, 2006.
[22]
A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.
[23]
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005.
[24]
X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. Advances in Neural Information Processing Systems, 17:1641--1648, 2005.

Cited By

View all
  • (2024)A Review: Comprehensive study on societal Analysis for health care system Using topic modeling Paradigms2024 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC)10.1109/ASSIC60049.2024.10507910(1-5)Online publication date: 27-Jan-2024
  • (2023)It is about inclusion! Mining online reviews to understand the needs of adaptive clothing customersInternational Journal of Consumer Studies10.1111/ijcs.1289547:3(1157-1172)Online publication date: 18-Jan-2023
  • (2023)Incorporating Word Embedding and Hybrid Model Random Forest Softmax Regression for Predicting News CategoriesMultimedia Tools and Applications10.1007/s11042-023-16491-7Online publication date: 15-Sep-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
April 2009
67 pages
ISBN:9781605584386
DOI:10.1145/1531914
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document classification
  2. feature selection
  3. information retrieval
  4. latent Dirichlet allocation
  5. text analysis
  6. web content spam

Qualifiers

  • Research-article

Conference

AIRWeb '09

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Review: Comprehensive study on societal Analysis for health care system Using topic modeling Paradigms2024 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC)10.1109/ASSIC60049.2024.10507910(1-5)Online publication date: 27-Jan-2024
  • (2023)It is about inclusion! Mining online reviews to understand the needs of adaptive clothing customersInternational Journal of Consumer Studies10.1111/ijcs.1289547:3(1157-1172)Online publication date: 18-Jan-2023
  • (2023)Incorporating Word Embedding and Hybrid Model Random Forest Softmax Regression for Predicting News CategoriesMultimedia Tools and Applications10.1007/s11042-023-16491-7Online publication date: 15-Sep-2023
  • (2023)Enhancing Readability in Custom Templates for Displaying Semantically Marked InformationAdvances in Computer Science for Engineering and Education VI10.1007/978-3-031-36118-0_13(137-146)Online publication date: 19-Aug-2023
  • (2021)Fesztivállátogatók véleményeinek számítógéppel támogatott tematikus modellezése– egy kísérlet eredményei = Computer-aided topic modelling based on festival-goers’ opinions– results of an experimentTurizmus Bulletin10.14267/TURBULL.2021v21n1.121:1(4-12)Online publication date: 21-Apr-2021
  • (2020)Detecting Web Spam Based on Novel Features from Web Page Source CodeSecurity and Communication Networks10.1155/2020/66621662020Online publication date: 17-Dec-2020
  • (2019)Identifying Latent Semantics in Action Games for Player ModelingInternational Journal of Gaming and Computer-Mediated Simulations10.4018/IJGCMS.201904010111:2(1-21)Online publication date: Apr-2019
  • (2019)Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle PhaseSecurity and Communication Networks10.1155/2019/65870202019Online publication date: 20-Feb-2019
  • (2019)Filtering spam messages and mails using fuzzy C means algorithm2019 4th International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU)10.1109/IoT-SIU.2019.8777483(1-5)Online publication date: Apr-2019
  • (2018)Unveiling Topics from Scientific Literature on the Subject of Self-driving Cars using Latent Dirichlet Allocation2018 IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)10.1109/IEMCON.2018.8615056(1113-1119)Online publication date: Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media