Abstract
With the development of generative models like GPT-3, it is increasingly more challenging to differentiate generated texts from human-written ones. There is a large number of studies that have demonstrated good results in bot identification. However, the majority of such works depend on supervised learning methods that require labelled data and/or prior knowledge about the bot-model architecture. In this work, we propose a bot identification algorithm that is based on unsupervised learning techniques and does not depend on a large amount of labelled data. By combining findings in semantic analysis by clustering (crisp and fuzzy) and information techniques, we construct a robust model that detects a generated text for different types of bot. We find that the generated texts tend to be more chaotic while literary works are more complex. We also demonstrate that the clustering of human texts results in fuzzier clusters in comparison to the more compact and well-separated clusters of bot-generated texts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Each algorithm has its advantages—K-Means separates spherical clusters well, whereas Wishart algorithm does not make any assumptions about cluster shapes.
References
Bellegarda, J.R.: Latent semantic mapping: principles & applications. Synthesis Lect. Speech Audio Process. 3(1), 1–101 (2007)
Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy C-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)
Cardaioli, M., Conti, M., Di Sorbo, A., Fabrizio, E., Laudanna, S., Visaggio, C.A.: It’s a matter of style: detecting social bots through writing style consistency. In: 2021 International Conference on Computer Communications and Networks (ICCCN), pp. 1–9. IEEE (2021)
Chakraborty, M., Das, S., Mamidi, R.: Detection of fake users in twitter using network representation and NLP. In: 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 754–758. IEEE (2022)
Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Detecting automation of twitter accounts: are you a human, bot, or cyborg? IEEE Trans. Dependable Secure Comput. 9(6), 811–824 (2012)
Dickerson, J.P., Kagan, V., Subrahmanian, V.: Using sentiment to detect bots on twitter: are humans more opinionated than bots? In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp. 620–627. IEEE (2014)
Gromov, V.A., Migrina, A.M.: A language as a self-organized critical system. Complexity 2017 (2017)
Heidari, M., James Jr, H., Uzuner, O.: An empirical study of machine learning algorithms for social media bot detection. In: 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), pp. 1–5. IEEE (2021)
Kang, A.R., Kim, H.K., Woo, J.: Chatting pattern based game bot detection: do they talk like us? KSII Trans. Internet Inf. Syst. (TIIS) 6(11), 2866–2879 (2012)
Kostenetskiy, P., Chulkevich, R., Kozyrev, V.: HPC resources of the higher school of economics. In: Journal of Physics: Conference Series, vol. 1740, p. 012050. IOP Publishing (2021)
MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Novák, V., Perfilieva, I., Mockor, J.: Mathematical Principles of Fuzzy Logic, vol. 517. Springer, Heidelberg (2012)
Rosso, O.A., Larrondo, H., Martin, M.T., Plastino, A., Fuentes, M.A.: Distinguishing noise from chaos. Phys. Rev. Lett. 99(15), 154102 (2007)
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics, pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Wishart, D.: Numerical classification method for deriving natural classes. Nature 221(5175), 97–98 (1969)
Xiong, H., Li, Z.: Clustering validation measures. In: Data Clustering, pp. 571–606. Chapman and Hall/CRC (2018)
Acknowledgements
This research was supported in part through computational resources of HPC facilities at HSE University [10]. The authors would also like to thank the HSE AI Center for the support throughout the research process.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gromov, V., Dang, Q.N. (2023). Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques. In: Maji, P., Huang, T., Pal, N.R., Chaudhury, S., De, R.K. (eds) Pattern Recognition and Machine Intelligence. PReMI 2023. Lecture Notes in Computer Science, vol 14301. Springer, Cham. https://doi.org/10.1007/978-3-031-45170-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-45170-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45169-0
Online ISBN: 978-3-031-45170-6
eBook Packages: Computer ScienceComputer Science (R0)