The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation

Georgi Karadzhov²¹,
Tsvetomila Mihaylova²¹,
Yasen Kiprov²¹,
Georgi Georgiev²¹,
Ivan Koychev²¹ &
…
Preslav Nakov²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10456))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1155 Accesses
10 Citations

Abstract

Users posting online expect to remain anonymous unless they have logged in, which is often needed for them to be able to discuss freely on various topics. Preserving the anonymity of a text’s writer can be also important in some other contexts, e.g., in the case of witness protection or anonymity programs. However, each person has his/her own style of writing, which can be analyzed using stylometry, and as a result, the true identity of the author of a piece of text can be revealed even if s/he has tried to hide it. Thus, it could be helpful to design automatic tools that can help a person obfuscate his/her identity when writing text. In particular, here we propose an approach that changes the text, so that it is pushed towards average values for some general stylometric characteristics, thus making the use of these characteristics less discriminative. The approach consists of three main steps: first, we calculate the values for some popular stylometric metrics that can indicate authorship; then we apply various transformations to the text, so that these metrics are adjusted towards the average level, while preserving the semantics and the soundness of the text; and finally, we add random noise. This approach turned out to be very efficient, and yielded the best performance on the Author Obfuscation task at the PAN-2016 competition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Secure Obfuscation of Authoring Style

Generalised Differential Privacy for Text Document Processing

Overview of PAN 2018

Notes

1.
http://pan.webis.de.
2.
We used the following books: The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle, History of the United States by Charles A. Beard and Mary R. Beard, Manual of Surgery Volume First: General Surgery by Alexis Thomson and Alexander Miles. Sixth Edition., and War and Peace, by Leo Tolstoy.
3.
We have released our code, including all our lexicons, in the following repository: https://bitbucket.org/pan2016authorobfuscation/authorobfuscation/.
4.
http://pan.webis.de/clef16/pan16-web/author-obfuscation.html.

References

Almishari, M., Oguz, E., Tsudik, G.: Fighting authorship linkability with crowdsourcing. In: Proceedings of the Second ACM Conference on Online Social Networks (COSN 2014), pp. 69–82. ACM, Dublin (2014)
Google Scholar
Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum (CLEF 2015), Toulouse (2015)
Google Scholar
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15(3), 12:1–12:22 (2012)
Article Google Scholar
Brennan, M.R., Greenstadt, R.: Practical attacks against authorship recognition techniques. In: Proceedings of the Twenty-First Innovative Applications of Artificial Intelligence Conference (IAAI 2009), Pasadena (2009)
Google Scholar
Fellbaum, C.: WordNet: An Electronic Lexical Database. Bradford Books, Cambridge (1998)
MATH Google Scholar
Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), Atlanta, pp. 758–764 (2013)
Google Scholar
Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Linguist. Comput. 13(3), 111–117 (1998)
Article Google Scholar
Juola, P.: Detecting stylistic deception. In: Proceedings of the Workshop on Computational Approaches to Deception Detection, Avignon, pp. 91–96 (2012)
Google Scholar
Juola, P., Vescovi, D.: Analyzing stylometric approaches to author obfuscation. In: Peterson, G., Shenoi, S. (eds.) DigitalForensics 2011. IAICT, vol. 361, pp. 115–125. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24212-0_9
Chapter Google Scholar
Kabbara, J., Cheung, J.C.K.: Stylistic transfer in natural language generation systems using recurrent neural networks. In: Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods, Austin, pp. 43–47 (2016)
Google Scholar
Kacmarcik, G., Gamon, M.: Obfuscating document stylometry to preserve author anonymity. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006), Sydney, pp. 444–451 (2006)
Google Scholar
Keswani, Y., Trivedi, H., Mehta, P., Majumder, P.: Author masking through translation. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora, pp. 890–894 (2016)
Google Scholar
Le, H., Safavi-Naini, R., Galib, A.: Secure obfuscation of authoring style. In: Akram, R.N., Jajodia, S. (eds.) WISTP 2015. LNCS, vol. 9311, pp. 88–103. Springer, Cham (2015). doi:10.1007/978-3-319-24018-3_6
Chapter Google Scholar
Mansoorizadeh, M., Rahgooy, T., Aminiyan, M., Eskandari, M.: Author obfuscation using WordNet and language models. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora, pp. 932–938 (2016)
Google Scholar
McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31680-7_16
Chapter Google Scholar
Mendenhall, T.C.: The characteristic curves of composition. Science 9(214), 237–249 (1887)
Article Google Scholar
Mihaylova, T., Karadjov, G., Nakov, P., Kiprov, Y., Georgiev, G., Koychev, I.: SU@PAN’2016: author obfuscation-notebook for PAN at CLEF 2016. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora (2016)
Google Scholar
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Potthast, M., Hagen, M., Stein, B.: Author obfuscation: Attacking the state of the art in authorship verification. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora, pp. 716–749 (2016)
Google Scholar
Quirk, C., Brockett, C., Dolan, W.: Monolingual machine translation for paraphrase generation. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, pp. 142–149 (2004)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum (CLEF 2015), Toulouse (2015)
Google Scholar

Download references

Acknowledgments

We thank the anonymous reviewers for their constructive comments, which have helped us improve the quality of the present paper.

This research was performed by a team of students from MSc programs in Computer Science in the Sofia University “St. Kliment Ohridski”. The work is supported by the NSF of Bulgaria under Grant No.: DN 02/11/2016 - ITDGate.

Author information

Authors and Affiliations

Faculty of Mathematics and Informatics, Sofia University “St. Kliment Ohridski”, Sofia, Bulgaria
Georgi Karadzhov, Tsvetomila Mihaylova, Yasen Kiprov, Georgi Georgiev & Ivan Koychev
Qatar Computing Research Institute, HBKU, Doha, Qatar
Preslav Nakov

Authors

Georgi Karadzhov
View author publications
You can also search for this author in PubMed Google Scholar
Tsvetomila Mihaylova
View author publications
You can also search for this author in PubMed Google Scholar
Yasen Kiprov
View author publications
You can also search for this author in PubMed Google Scholar
Georgi Georgiev
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Koychev
View author publications
You can also search for this author in PubMed Google Scholar
Preslav Nakov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tsvetomila Mihaylova .

Editor information

Editors and Affiliations

Dublin City University, Dublin, Ireland
Gareth J.F. Jones
Trinity College Dublin, Dublin, Ireland
Séamus Lawless
National University of Distance Education, Madrid, Spain
Julio Gonzalo
Dublin City University, Dublin, Ireland
Liadh Kelly
Université Grenoble Alpes, Grenoble, France
Lorraine Goeuriot
University of Hildesheim, Hildesheim, Germany
Thomas Mandl
University of Padua, Padua, Italy
Linda Cappellato
University of Padua, Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karadzhov, G., Mihaylova, T., Kiprov, Y., Georgiev, G., Koychev, I., Nakov, P. (2017). The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation. In: Jones, G., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Science(), vol 10456. Springer, Cham. https://doi.org/10.1007/978-3-319-65813-1_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-65813-1_18
Published: 17 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65812-4
Online ISBN: 978-3-319-65813-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Secure Obfuscation of Authoring Style

Generalised Differential Privacy for Text Document Processing

Overview of PAN 2018

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Secure Obfuscation of Authoring Style

Generalised Differential Privacy for Text Document Processing

Overview of PAN 2018

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation