Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14301))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

1018 Accesses

Abstract

With the development of generative models like GPT-3, it is increasingly more challenging to differentiate generated texts from human-written ones. There is a large number of studies that have demonstrated good results in bot identification. However, the majority of such works depend on supervised learning methods that require labelled data and/or prior knowledge about the bot-model architecture. In this work, we propose a bot identification algorithm that is based on unsupervised learning techniques and does not depend on a large amount of labelled data. By combining findings in semantic analysis by clustering (crisp and fuzzy) and information techniques, we construct a robust model that detects a generated text for different types of bot. We find that the generated texts tend to be more chaotic while literary works are more complex. We also demonstrate that the clustering of human texts results in fuzzier clusters in comparison to the more compact and well-separated clusters of bot-generated texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Searching for Unknown Unknowns: Unsupervised Bot Detection to Defeat an Adaptive Adversary

Classification of Human and Machine-Generated Texts Using Lexical Features and Supervised/Unsupervised Machine Learning Algorithms

A method for K-Means seeds generation applied to text mining

Article 11 November 2015

Notes

1.
Each algorithm has its advantages—K-Means separates spherical clusters well, whereas Wishart algorithm does not make any assumptions about cluster shapes.

References

Bellegarda, J.R.: Latent semantic mapping: principles & applications. Synthesis Lect. Speech Audio Process. 3(1), 1–101 (2007)
Article Google Scholar
Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy C-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)
Article Google Scholar
Cardaioli, M., Conti, M., Di Sorbo, A., Fabrizio, E., Laudanna, S., Visaggio, C.A.: It’s a matter of style: detecting social bots through writing style consistency. In: 2021 International Conference on Computer Communications and Networks (ICCCN), pp. 1–9. IEEE (2021)
Google Scholar
Chakraborty, M., Das, S., Mamidi, R.: Detection of fake users in twitter using network representation and NLP. In: 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 754–758. IEEE (2022)
Google Scholar
Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Detecting automation of twitter accounts: are you a human, bot, or cyborg? IEEE Trans. Dependable Secure Comput. 9(6), 811–824 (2012)
Article Google Scholar
Dickerson, J.P., Kagan, V., Subrahmanian, V.: Using sentiment to detect bots on twitter: are humans more opinionated than bots? In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp. 620–627. IEEE (2014)
Google Scholar
Gromov, V.A., Migrina, A.M.: A language as a self-organized critical system. Complexity 2017 (2017)
Google Scholar
Heidari, M., James Jr, H., Uzuner, O.: An empirical study of machine learning algorithms for social media bot detection. In: 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), pp. 1–5. IEEE (2021)
Google Scholar
Kang, A.R., Kim, H.K., Woo, J.: Chatting pattern based game bot detection: do they talk like us? KSII Trans. Internet Inf. Syst. (TIIS) 6(11), 2866–2879 (2012)
Google Scholar
Kostenetskiy, P., Chulkevich, R., Kozyrev, V.: HPC resources of the higher school of economics. In: Journal of Physics: Conference Series, vol. 1740, p. 012050. IOP Publishing (2021)
Google Scholar
MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Novák, V., Perfilieva, I., Mockor, J.: Mathematical Principles of Fuzzy Logic, vol. 517. Springer, Heidelberg (2012)
MATH Google Scholar
Rosso, O.A., Larrondo, H., Martin, M.T., Plastino, A., Fuentes, M.A.: Distinguishing noise from chaos. Phys. Rev. Lett. 99(15), 154102 (2007)
Article Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics, pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Chapter Google Scholar
Wishart, D.: Numerical classification method for deriving natural classes. Nature 221(5175), 97–98 (1969)
Article Google Scholar
Xiong, H., Li, Z.: Clustering validation measures. In: Data Clustering, pp. 571–606. Chapman and Hall/CRC (2018)
Google Scholar

Download references

Acknowledgements

This research was supported in part through computational resources of HPC facilities at HSE University [10]. The authors would also like to thank the HSE AI Center for the support throughout the research process.

Author information

Authors and Affiliations

National Research University Higher School of Economics, Moscow, Russia
Vasilii Gromov & Quynh Nhu Dang

Authors

Vasilii Gromov
View author publications
You can also search for this author in PubMed Google Scholar
Quynh Nhu Dang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Quynh Nhu Dang .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
Pradipta Maji
Texas A&M University at Qatar, Doha, Qatar
Tingwen Huang
Indian Statistical Institute, Kolkata, West Bengal, India
Nikhil R. Pal
Indian Institute of Technology Jodhpur, Jodhpur, India
Santanu Chaudhury
Indian Statistical Institute, Kolkata, West Bengal, India
Rajat K. De

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gromov, V., Dang, Q.N. (2023). Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques. In: Maji, P., Huang, T., Pal, N.R., Chaudhury, S., De, R.K. (eds) Pattern Recognition and Machine Intelligence. PReMI 2023. Lecture Notes in Computer Science, vol 14301. Springer, Cham. https://doi.org/10.1007/978-3-031-45170-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-45170-6_3
Published: 04 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45169-0
Online ISBN: 978-3-031-45170-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Searching for Unknown Unknowns: Unsupervised Bot Detection to Defeat an Adaptive Adversary

Classification of Human and Machine-Generated Texts Using Lexical Features and Supervised/Unsupervised Machine Learning Algorithms

A method for K-Means seeds generation applied to text mining

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Searching for Unknown Unknowns: Unsupervised Bot Detection to Defeat an Adaptive Adversary

Classification of Human and Machine-Generated Texts Using Lexical Features and Supervised/Unsupervised Machine Learning Algorithms

A method for K-Means seeds generation applied to text mining

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation