Abstract
In an emerging trend, more and more Internet users search for information from Community Question and Answer (CQA) websites, as interactive communication in such websites provides users with a rare feeling of trust. More often than not, end users look for instant help when they browse the CQA websites for the best answers. Hence, it is imperative that they should be warned of any potential commercial campaigns hidden behind the answers. Existing research focuses more on the quality of answers and does not meet the above need. Textual similarities between questions and answers are widely used in previous research. However, this feature will no longer be effective when facing commercial paid posters. More context information, such as writing templates and a user’s reputation track, needs to be combined together to form a new model to detect the potential campaign answers. In this paper, we develop a system that automatically analyzes the hidden patterns of commercial spam and raises alarms instantaneously to end users whenever a potential commercial campaign is detected. Our detection method integrates semantic analysis and posters’ track records and utilizes the special features of CQA websites largely different from those in other types of forums such as microblogs or news reports. Our system is adaptive and accommodates new evidence uncovered by the detection algorithms over time. Validated with real-world trace data from a popular Chinese CQA website over a period of three months, our system shows great potential towards adaptive detection of CQA spams.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Jeon J, Croft W B, Lee J H, Park S. A framework to predict the quality of answers with non-textual features. In Proc. the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 2006, pp. 228–235.
Jurczyk P, Agichtein E. Discovering authorities in question answer communities by using link analysis. In Proc. the 16th ACM Conference on Information and Knowledge Management, November 2007, pp. 919–922.
Agichtein E, Castillo C, Donato D, Gionis A, Mishne G. Finding high-quality content in social media. In Proc. the International Conference on Web Search and Web Data Mining, February 2008, pp. 183–194.
Wang G, Wilson C, Zhao X, Zhu Y, Mohanlal M, Zheng H, Zhao B Y. Serf and turf: Crowdturfing for fun and profit. In Proc. the 21st International Conference on World Wide Web, April 2012, pp. 679–688.
Liu Y, Li S, Cao Y, Lin C Y, Han D, Yu Y. Understanding and summarizing answers in community-based question answering services. In Proc. the 22nd International Conference on Computational Linguistics, Volume 1, August 2008, pp. 497–504.
Bian J, Liu Y, Agichtein E, Zha H. Finding the right facts in the crowd: Factoid question answering over social media. In Proc. the 17th International Conference on World Wide Web, April 2008, pp. 467–476.
Bian J, Liu Y, Zhou D, Agichtein E, Zha H. Learning to recognize reliable users and content in social media with coupled mutual reinforcement. In Proc. the 18th International Conference on World Wide Web, April 2009, pp. 51–60.
Kleinberg J M. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999, 46(5): 604–632.
Bian J, Liu Y, Agichtein E, Zha H. A few bad votes too many? Towards robust ranking in social media. In Proc. the 4th International Workshop on Adversarial Information Retrieval on the Web, April 2008, pp. 53–60.
Page L, Brin S, Motwani R, Winograd T. The pagerank citation ranking: Bringing order to the Web. Technical Report SIDL-WP-1999-0120, Stanford Digital Library Technologies Project, 1998.
Pera M S, Ng Y. A community question-answering refinement system. In Proc. the 22nd ACM Conference on Hypertext and Hypermedia, June 2011, pp. 251–260.
Fichman P. A comparative assessment of answer quality on four question answering sites. Journal of Information Science, 2011, 37(5): 476–486.
Sakai T, Ishikawa D, Kando N, Seki Y, Kuriyama K, Lin C. Using graded-relevance metrics for evaluating community QA answer selection. In Proc. the 4th International Conference on Web Search and Web Data Mining, February 2011, pp. 187–196.
Jindal N, Liu B. Opinion spam and analysis. In Proc. the International Conference on Web Search and Web Data Mining, February 2008, pp. 219–230.
Ott M, Choi Y, Cardie C, Hancock J T. Finding deceptive opinion spam by any stretch of the imagination. In Proc. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1, June 2011, pp. 309–319.
Mukherjee A, Liu B, Glance N S. Spotting fake reviewer groups in consumer reviews. In Proc. the 21st International Conference on World Wide Web, April 2012, pp. 191–200.
Huang M, Yang Y, Zhu X. Quality-biased ranking of short texts in microblogging services. In Proc. the 5th International Joint Conference on Natural Language Processing, November 2011, pp. 373–382.
Huang C, Jiang Q, Zhang Y. Detecting comment spam through content analysis. In Proc. the 2010 International Conference on Web-Age Information Management, July 2010, pp. 222–233.
Chen C, Wu K, Srinivasan V, Zhang X. Battling the Internet water army: Detection of hidden paid posters. In Proc. the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, August 2013, pp. 116–120.
Kapur J, Kesavan H. Entropy Optimization Principles with Applications. Academic Press Inc., 1992.
McFadden D. Conditional logit analysis of qualitative choice behavior. In Frontiers in Econometrics, Zarembka P(ed.), New York: Academic Press, 1974, pp. 105–142.
Chang C, Lin C. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 27:1–27:27.
Fan R, Chang K, Hsieh C, Wang X, Lin C. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 2008, 9: 1871–1874.
Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 80–89.
Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, March 2003, 3: 1289–1305.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was partially supported by the Natural Sciences and Engineering Research Council of Canada under Grant No. 195819339 and the Globalink Internship of Mathematics of Information Technology and Complex Systems (MITACS) of Canada.
Rights and permissions
About this article
Cite this article
Chen, C., Wu, K., Srinivasan, V. et al. The Best Answers? Think Twice: Identifying Commercial Campagins in the CQA Forums. J. Comput. Sci. Technol. 30, 810–828 (2015). https://doi.org/10.1007/s11390-015-1562-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-015-1562-x