Nothing Special   »   [go: up one dir, main page]

Skip to main content

Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques

  • Conference paper
  • First Online:
Pattern Recognition and Machine Intelligence (PReMI 2023)

Abstract

With the development of generative models like GPT-3, it is increasingly more challenging to differentiate generated texts from human-written ones. There is a large number of studies that have demonstrated good results in bot identification. However, the majority of such works depend on supervised learning methods that require labelled data and/or prior knowledge about the bot-model architecture. In this work, we propose a bot identification algorithm that is based on unsupervised learning techniques and does not depend on a large amount of labelled data. By combining findings in semantic analysis by clustering (crisp and fuzzy) and information techniques, we construct a robust model that detects a generated text for different types of bot. We find that the generated texts tend to be more chaotic while literary works are more complex. We also demonstrate that the clustering of human texts results in fuzzier clusters in comparison to the more compact and well-separated clusters of bot-generated texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Each algorithm has its advantages—K-Means separates spherical clusters well, whereas Wishart algorithm does not make any assumptions about cluster shapes.

References

  1. Bellegarda, J.R.: Latent semantic mapping: principles & applications. Synthesis Lect. Speech Audio Process. 3(1), 1–101 (2007)

    Article  Google Scholar 

  2. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy C-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)

    Article  Google Scholar 

  3. Cardaioli, M., Conti, M., Di Sorbo, A., Fabrizio, E., Laudanna, S., Visaggio, C.A.: It’s a matter of style: detecting social bots through writing style consistency. In: 2021 International Conference on Computer Communications and Networks (ICCCN), pp. 1–9. IEEE (2021)

    Google Scholar 

  4. Chakraborty, M., Das, S., Mamidi, R.: Detection of fake users in twitter using network representation and NLP. In: 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS), pp. 754–758. IEEE (2022)

    Google Scholar 

  5. Chu, Z., Gianvecchio, S., Wang, H., Jajodia, S.: Detecting automation of twitter accounts: are you a human, bot, or cyborg? IEEE Trans. Dependable Secure Comput. 9(6), 811–824 (2012)

    Article  Google Scholar 

  6. Dickerson, J.P., Kagan, V., Subrahmanian, V.: Using sentiment to detect bots on twitter: are humans more opinionated than bots? In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), pp. 620–627. IEEE (2014)

    Google Scholar 

  7. Gromov, V.A., Migrina, A.M.: A language as a self-organized critical system. Complexity 2017 (2017)

    Google Scholar 

  8. Heidari, M., James Jr, H., Uzuner, O.: An empirical study of machine learning algorithms for social media bot detection. In: 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), pp. 1–5. IEEE (2021)

    Google Scholar 

  9. Kang, A.R., Kim, H.K., Woo, J.: Chatting pattern based game bot detection: do they talk like us? KSII Trans. Internet Inf. Syst. (TIIS) 6(11), 2866–2879 (2012)

    Google Scholar 

  10. Kostenetskiy, P., Chulkevich, R., Kozyrev, V.: HPC resources of the higher school of economics. In: Journal of Physics: Conference Series, vol. 1740, p. 012050. IOP Publishing (2021)

    Google Scholar 

  11. MacQueen, J.: Classification and analysis of multivariate observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  13. Novák, V., Perfilieva, I., Mockor, J.: Mathematical Principles of Fuzzy Logic, vol. 517. Springer, Heidelberg (2012)

    MATH  Google Scholar 

  14. Rosso, O.A., Larrondo, H., Martin, M.T., Plastino, A., Fuentes, M.A.: Distinguishing noise from chaos. Phys. Rev. Lett. 99(15), 154102 (2007)

    Article  Google Scholar 

  15. Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics, pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16

    Chapter  Google Scholar 

  16. Wishart, D.: Numerical classification method for deriving natural classes. Nature 221(5175), 97–98 (1969)

    Article  Google Scholar 

  17. Xiong, H., Li, Z.: Clustering validation measures. In: Data Clustering, pp. 571–606. Chapman and Hall/CRC (2018)

    Google Scholar 

Download references

Acknowledgements

This research was supported in part through computational resources of HPC facilities at HSE University [10]. The authors would also like to thank the HSE AI Center for the support throughout the research process.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quynh Nhu Dang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gromov, V., Dang, Q.N. (2023). Spot the Bot: Distinguishing Human-Written and Bot-Generated Texts Using Clustering and Information Theory Techniques. In: Maji, P., Huang, T., Pal, N.R., Chaudhury, S., De, R.K. (eds) Pattern Recognition and Machine Intelligence. PReMI 2023. Lecture Notes in Computer Science, vol 14301. Springer, Cham. https://doi.org/10.1007/978-3-031-45170-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45170-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45169-0

  • Online ISBN: 978-3-031-45170-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics