Abstract
Users posting online expect to remain anonymous unless they have logged in, which is often needed for them to be able to discuss freely on various topics. Preserving the anonymity of a text’s writer can be also important in some other contexts, e.g., in the case of witness protection or anonymity programs. However, each person has his/her own style of writing, which can be analyzed using stylometry, and as a result, the true identity of the author of a piece of text can be revealed even if s/he has tried to hide it. Thus, it could be helpful to design automatic tools that can help a person obfuscate his/her identity when writing text. In particular, here we propose an approach that changes the text, so that it is pushed towards average values for some general stylometric characteristics, thus making the use of these characteristics less discriminative. The approach consists of three main steps: first, we calculate the values for some popular stylometric metrics that can indicate authorship; then we apply various transformations to the text, so that these metrics are adjusted towards the average level, while preserving the semantics and the soundness of the text; and finally, we add random noise. This approach turned out to be very efficient, and yielded the best performance on the Author Obfuscation task at the PAN-2016 competition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
We used the following books: The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle, History of the United States by Charles A. Beard and Mary R. Beard, Manual of Surgery Volume First: General Surgery by Alexis Thomson and Alexander Miles. Sixth Edition., and War and Peace, by Leo Tolstoy.
- 3.
We have released our code, including all our lexicons, in the following repository: https://bitbucket.org/pan2016authorobfuscation/authorobfuscation/.
- 4.
References
Almishari, M., Oguz, E., Tsudik, G.: Fighting authorship linkability with crowdsourcing. In: Proceedings of the Second ACM Conference on Online Social Networks (COSN 2014), pp. 69–82. ACM, Dublin (2014)
Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum (CLEF 2015), Toulouse (2015)
Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15(3), 12:1–12:22 (2012)
Brennan, M.R., Greenstadt, R.: Practical attacks against authorship recognition techniques. In: Proceedings of the Twenty-First Innovative Applications of Artificial Intelligence Conference (IAAI 2009), Pasadena (2009)
Fellbaum, C.: WordNet: An Electronic Lexical Database. Bradford Books, Cambridge (1998)
Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), Atlanta, pp. 758–764 (2013)
Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Linguist. Comput. 13(3), 111–117 (1998)
Juola, P.: Detecting stylistic deception. In: Proceedings of the Workshop on Computational Approaches to Deception Detection, Avignon, pp. 91–96 (2012)
Juola, P., Vescovi, D.: Analyzing stylometric approaches to author obfuscation. In: Peterson, G., Shenoi, S. (eds.) DigitalForensics 2011. IAICT, vol. 361, pp. 115–125. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24212-0_9
Kabbara, J., Cheung, J.C.K.: Stylistic transfer in natural language generation systems using recurrent neural networks. In: Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods, Austin, pp. 43–47 (2016)
Kacmarcik, G., Gamon, M.: Obfuscating document stylometry to preserve author anonymity. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006), Sydney, pp. 444–451 (2006)
Keswani, Y., Trivedi, H., Mehta, P., Majumder, P.: Author masking through translation. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora, pp. 890–894 (2016)
Le, H., Safavi-Naini, R., Galib, A.: Secure obfuscation of authoring style. In: Akram, R.N., Jajodia, S. (eds.) WISTP 2015. LNCS, vol. 9311, pp. 88–103. Springer, Cham (2015). doi:10.1007/978-3-319-24018-3_6
Mansoorizadeh, M., Rahgooy, T., Aminiyan, M., Eskandari, M.: Author obfuscation using WordNet and language models. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora, pp. 932–938 (2016)
McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31680-7_16
Mendenhall, T.C.: The characteristic curves of composition. Science 9(214), 237–249 (1887)
Mihaylova, T., Karadjov, G., Nakov, P., Kiprov, Y., Georgiev, G., Koychev, I.: SU@PAN’2016: author obfuscation-notebook for PAN at CLEF 2016. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora (2016)
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Potthast, M., Hagen, M., Stein, B.: Author obfuscation: Attacking the state of the art in authorship verification. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora, pp. 716–749 (2016)
Quirk, C., Brockett, C., Dolan, W.: Monolingual machine translation for paraphrase generation. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, pp. 142–149 (2004)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum (CLEF 2015), Toulouse (2015)
Acknowledgments
We thank the anonymous reviewers for their constructive comments, which have helped us improve the quality of the present paper.
This research was performed by a team of students from MSc programs in Computer Science in the Sofia University “St. Kliment Ohridski”. The work is supported by the NSF of Bulgaria under Grant No.: DN 02/11/2016 - ITDGate.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Karadzhov, G., Mihaylova, T., Kiprov, Y., Georgiev, G., Koychev, I., Nakov, P. (2017). The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation. In: Jones, G., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Science(), vol 10456. Springer, Cham. https://doi.org/10.1007/978-3-319-65813-1_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-65813-1_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65812-4
Online ISBN: 978-3-319-65813-1
eBook Packages: Computer ScienceComputer Science (R0)