Abstract
In this paper, we employ machine learning methods to solve the problem of countering terrorism and extremism by using information from the Internet. This problem involves retrieving electronic messages, documents, and web resources that potentially contain information of terrorist or extremist nature, identifying the structure of user groups and online communities that disseminate this information, monitoring and modeling information flows in these communities, as well as assessing threats and predicting risks based on monitoring results. We propose some original language-independent algorithms for pattern-based information retrieval, thematic modeling, and prediction of message flow characteristics, as well as assessment and prediction of potential risk coming from members of online communities by using data on the structure of relations in these communities, which makes it possible to detect potentially dangerous users even without full access to the content they distribute, e.g., through private channels and chat rooms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.REFERENCES
Why big data analytics holds the key to tackling the changing terror threat, J. Adv. Anal. Intell., 2015. http:// www.sas.com/content/dam/SAS/en_us/doc/other1/ iq-q115.pdf.
Hankin, C., IDEAS factory – Detecting terrorist activities: Making sense. http://gow.epsrc.ac.uk/NGBOViewGrant.aspx?GrantRef=EP/H023135/1.
Nizamani, S. et al., Modeling suspicious email detection using enhanced feature selection, 2013, preprint arXiv 1312.1971.
Sheehan, I.S., Assessing and comparing data sources for terrorism research, Evidence-Based Counterterrorism Policy, New York: Springer, 2012, vol. 3, pp. 13–40.
Berger, J.M. and Morgan, J., The ISIS Twitter Census, Brookings Project on US Relations with the Islamic World, 2015, no. 20.
IDEAS factory – Detecting terrorist activities: Making sense. http://www.slideserve.com/fawzia/detecting-terrorist-activities-making-sense.
Proc. Workshop Link Analysis, Counterterrorism, and Security, SIAM Int. Conf. Data Mining, California, 2005. http://research.cs.queensu.ca/home/skill.
Zhang, Y., Zeng, S., Fan, L., Dang, Y., Catherine, A., Larson, C.A., and Chen, H., Dark web forums portal: Searching and analyzing jihadist forums, Proc. IEEE Int. Conf. Intelligence and Security Informatics (ISI), Piscataway, USA, 2009, pp. 71–76.
Abbasi, A. and Chen, H., Applying authorship analysis to extremist-group web forum messages, IEEE Intell. Syst., 2005, vol. 20, pp. 67–75.
Ríos, S.A. and Muñoz, R., Dark web portal overlapping community detection based on topic models, Proc. ACM SIGKDD Workshop Intelligence and Security Informatics (ISI-KDD), New York, 2012.
Kuang, D., Choo, J., and Park, H., Nonnegative matrix factorization for interactive topic modeling and document clustering, in Partitional Clustering Algorithms, Springer, 2015, pp. 215–243.
Tsarev, D.V., Petrovskiy, M.I., and Mashechkin, I.V., Using NMF-based text summarization to improve supervised and unsupervised classification, Proc. 11th IEEE Int. Conf. Application of Information and Communication Technologies, 2011, pp. 185–189.
Elovici, Y., Shapira, B., Last, M., Zaafrany, O., Friedman, M., Schneider, M., and Kandel, A., Detection of access to terror-related web sites using an advanced terror detection system (ATDS), J. Am. Soc. Inf. Sci. Technol., 2010, vol. 61, pp. 405–418.
Agarwal, S. and Sureka, A., Applying social media intelligence for predicting and identifying on-line radicalization and civil unrest oriented threats, 2015, arXiv 1511.06858.
Badia, A. and Kantardzic, M., Link analysis tools for intelligence and counterterrorism, Lect. Notes Comput. Sci., vol. 3495, pp. 49–59.
Ferrara, E., Wang, W.-Q., Varol, O., Flammini, A., and Galstyan, A., Predicting online extremism, content adopters, and interaction reciprocity, Proc. Int. Conf. Social Informatics, 2016, pp. 22–39.
Ríos, S.A. and Muñoz, R., Dark web portal overlapping community detection based on topic models, Proc. ACM SIGKDD Workshop Intelligence and Security Informatics (ISI-KDD), New York, 2012.
Toure, I. and Gangopadhyay, A., Analyzing terror attacks using latent semantic indexing, Proc. IEEE Int. Conf. Technologies for Homeland Security (HST), 2013, pp. 334–337.
Scanlon, J.R. and Gerber, M.S., Forecasting violent extremist cyber recruitment, IEEE Trans. Inf. Forensics Secur., 2015, vol. 10, no. 11, pp. 2461–2470.
L’Huillier, G., Alvarez, H., Ríos, S.A., and Aguilera, F., Topic-based social network analysis for virtual communities of interests in the dark web, SIGKDD Explor. Newsl., 2011, vol. 12, no. 2, pp. 66–73.
Yang, L., Liu, F., Kizza, J.M., and Ege, R.K., Discovering topics from dark websites, Proc. IEEE Symp. Computational Intelligence in Cyber Security (CICS), 2009, pp. 175–179.
Petrovskiy, M., Tsarev, D., and Pospelova, I., Pattern based information retrieval approach to discover extremist information on the Internet, Mining Intelligence and Knowledge Exploration, Ghosh, A., Pal, R., and Prasath, R., Eds., Springer, 2017.
Manning, C.D. et al., Introduction to Information Retrieval, Cambridge University Press, 2008, vol. 1.
Chisholm, E. and Kolda, T.G., New term weighting formulas for the vector space method in information retrieval, Computer Science and Mathematics Division, Oak Ridge National Laboratory, 1999.
Landauer, T.K. and Dumais, S.T., A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge, Psychol. Rev., 1997, vol. 104, no. 2, p. 211.
Lee, D.D. and Seung, H.S., Learning the parts of objects by non-negative matrix factorization, Nature, 1999, vol. 401, no. 6755, pp. 788–791.
Tsarev, D.V., Petrovskiy, M.I., Mashechkin, I.V., and Popov, D.S., Automatic text summarization using latent semantic analysis, Program. Comput. Software, 2011, vol. 37, no. 6, pp. 299–305.
Steinberger, J. and Ježek, K., Text summarization and singular value decomposition, Advances in Information Systems, Berlin: Springer, 2005, pp. 245–254.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X., A density-based algorithm for discovering clusters in large spatial databases with noise, Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining (KDD), 1996, pp. 226–231.
Levenshtein, V.I., Binary codes with correction of fallouts, insertions, and substitutions of characters, Dokl. Akad. Nauk SSSR (Proc. Acad. Sci. USSR), 1965, vol. 163, no. 4, pp. 845–848.
Hurvich, C.M., Simonoff, J.S., and Tsai, C.L., Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion, J. R. Stat. Soc. B, 1998, vol. 60, pp. 271–293.
Salvador, S. and Chan, P., FastDTW: Toward accurate dynamic time warping in linear time and space, Proc. KDD Workshop Mining Temporal and Sequential Data, 2004, pp. 70–80.
Notation for ARIMA models, Time Series Forecasting System, SAS Institute.
Shehabat, A., Mitew, T., and Alzoubi, Y., Encrypted jihad: Investigating the role of Telegram app in lone wolf attacks in the West, J. Strategic Secur., 2017, no. 3, pp. 27–53.
Page, L., Brin, S., Motwani, R., and Winograd, T., The pagerank citation ranking: Bringing order to the web, Stanford InfoLab, 1999.
Kleinberg, J.M., Authoritative sources in a hyperlinked environment, J. ACM, 1999, vol. 46, nos. 5–7, pp. 604–632.
Wasserman, S. and Faust, K., Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences), Cambridge University Press, 1994, 1st ed.
Chen, T. and Guestrin, C., XGBoost: A scalable tree boosting system, Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2016, pp. 785–794.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.Y., LightGBM: A highly efficient gradient boosting decision tree, Adv. Neural Inf. Process. Syst., 2017, pp. 3149–3157.
Baldi, P., Autoencoders, unsupervised learning, and deep architectures, Proc. ICML Workshop Unsupervised and Transfer Learning, 2012, pp. 37–49.
The 20 Newsgroups data set. http://people.csail.mit. edu/jrennie/20Newsgroups.
Kaggle “How ISIS uses Twitter” dataset. http://www. kaggle.com/fifthtribe/how-isis-uses-twitter.
Kaggle “ISIS religious texts” dataset. http://www.kaggle.com/fifthtribe/isis-religious-texts.
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J., Classification and regression trees, Monterey: Wadsworth & Brooks/Cole Advanced Books & Software, 1984.
Breiman, L., Bagging predictors, Mach. Learn., 1996, vol. 24, no. 2, pp. 123–140.
Hutter, F., Hoos, H., and Leyton-Brown, K., Sequential model-based optimization for general algorithm configuration, Learn. Intell. Optim., 2011, pp. 507–523.
ACKNOWLEDGMENTS
This work was supported by the Russian Foundation for Basic Research, project no. 16-29-09555 ofi_m.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Translated by Yu. Kornienko
Rights and permissions
About this article
Cite this article
Mashechkin, I.V., Petrovskiy, M.I., Tsarev, D.V. et al. Machine Learning Methods for Detecting and Monitoring Extremist Information on the Internet. Program Comput Soft 45, 99–115 (2019). https://doi.org/10.1134/S0361768819030058
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768819030058