Nothing Special   »   [go: up one dir, main page]

Skip to main content

The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation

(Best of the Labs Track at CLEF-2017)

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2017)

Abstract

Users posting online expect to remain anonymous unless they have logged in, which is often needed for them to be able to discuss freely on various topics. Preserving the anonymity of a text’s writer can be also important in some other contexts, e.g., in the case of witness protection or anonymity programs. However, each person has his/her own style of writing, which can be analyzed using stylometry, and as a result, the true identity of the author of a piece of text can be revealed even if s/he has tried to hide it. Thus, it could be helpful to design automatic tools that can help a person obfuscate his/her identity when writing text. In particular, here we propose an approach that changes the text, so that it is pushed towards average values for some general stylometric characteristics, thus making the use of these characteristics less discriminative. The approach consists of three main steps: first, we calculate the values for some popular stylometric metrics that can indicate authorship; then we apply various transformations to the text, so that these metrics are adjusted towards the average level, while preserving the semantics and the soundness of the text; and finally, we add random noise. This approach turned out to be very efficient, and yielded the best performance on the Author Obfuscation task at the PAN-2016 competition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://pan.webis.de.

  2. 2.

    We used the following books: The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle, History of the United States by Charles A. Beard and Mary R. Beard, Manual of Surgery Volume First: General Surgery by Alexis Thomson and Alexander Miles. Sixth Edition., and War and Peace, by Leo Tolstoy.

  3. 3.

    We have released our code, including all our lexicons, in the following repository: https://bitbucket.org/pan2016authorobfuscation/authorobfuscation/.

  4. 4.

    http://pan.webis.de/clef16/pan16-web/author-obfuscation.html.

References

  1. Almishari, M., Oguz, E., Tsudik, G.: Fighting authorship linkability with crowdsourcing. In: Proceedings of the Second ACM Conference on Online Social Networks (COSN 2014), pp. 69–82. ACM, Dublin (2014)

    Google Scholar 

  2. Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum (CLEF 2015), Toulouse (2015)

    Google Scholar 

  3. Brennan, M., Afroz, S., Greenstadt, R.: Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15(3), 12:1–12:22 (2012)

    Article  Google Scholar 

  4. Brennan, M.R., Greenstadt, R.: Practical attacks against authorship recognition techniques. In: Proceedings of the Twenty-First Innovative Applications of Artificial Intelligence Conference (IAAI 2009), Pasadena (2009)

    Google Scholar 

  5. Fellbaum, C.: WordNet: An Electronic Lexical Database. Bradford Books, Cambridge (1998)

    MATH  Google Scholar 

  6. Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013), Atlanta, pp. 758–764 (2013)

    Google Scholar 

  7. Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Linguist. Comput. 13(3), 111–117 (1998)

    Article  Google Scholar 

  8. Juola, P.: Detecting stylistic deception. In: Proceedings of the Workshop on Computational Approaches to Deception Detection, Avignon, pp. 91–96 (2012)

    Google Scholar 

  9. Juola, P., Vescovi, D.: Analyzing stylometric approaches to author obfuscation. In: Peterson, G., Shenoi, S. (eds.) DigitalForensics 2011. IAICT, vol. 361, pp. 115–125. Springer, Heidelberg (2011). doi:10.1007/978-3-642-24212-0_9

    Chapter  Google Scholar 

  10. Kabbara, J., Cheung, J.C.K.: Stylistic transfer in natural language generation systems using recurrent neural networks. In: Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods, Austin, pp. 43–47 (2016)

    Google Scholar 

  11. Kacmarcik, G., Gamon, M.: Obfuscating document stylometry to preserve author anonymity. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL 2006), Sydney, pp. 444–451 (2006)

    Google Scholar 

  12. Keswani, Y., Trivedi, H., Mehta, P., Majumder, P.: Author masking through translation. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora, pp. 890–894 (2016)

    Google Scholar 

  13. Le, H., Safavi-Naini, R., Galib, A.: Secure obfuscation of authoring style. In: Akram, R.N., Jajodia, S. (eds.) WISTP 2015. LNCS, vol. 9311, pp. 88–103. Springer, Cham (2015). doi:10.1007/978-3-319-24018-3_6

    Chapter  Google Scholar 

  14. Mansoorizadeh, M., Rahgooy, T., Aminiyan, M., Eskandari, M.: Author obfuscation using WordNet and language models. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora, pp. 932–938 (2016)

    Google Scholar 

  15. McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31680-7_16

    Chapter  Google Scholar 

  16. Mendenhall, T.C.: The characteristic curves of composition. Science 9(214), 237–249 (1887)

    Article  Google Scholar 

  17. Mihaylova, T., Karadjov, G., Nakov, P., Kiprov, Y., Georgiev, G., Koychev, I.: SU@PAN’2016: author obfuscation-notebook for PAN at CLEF 2016. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora (2016)

    Google Scholar 

  18. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  19. Potthast, M., Hagen, M., Stein, B.: Author obfuscation: Attacking the state of the art in authorship verification. In: Working Notes of CLEF 2016 - Conference and Labs of the Evaluation Forum (CLEF 2016), Évora, pp. 716–749 (2016)

    Google Scholar 

  20. Quirk, C., Brockett, C., Dolan, W.: Monolingual machine translation for paraphrase generation. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, pp. 142–149 (2004)

    Google Scholar 

  21. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  22. Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum (CLEF 2015), Toulouse (2015)

    Google Scholar 

Download references

Acknowledgments

We thank the anonymous reviewers for their constructive comments, which have helped us improve the quality of the present paper.

This research was performed by a team of students from MSc programs in Computer Science in the Sofia University “St. Kliment Ohridski”. The work is supported by the NSF of Bulgaria under Grant No.: DN 02/11/2016 - ITDGate.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tsvetomila Mihaylova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Karadzhov, G., Mihaylova, T., Kiprov, Y., Georgiev, G., Koychev, I., Nakov, P. (2017). The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation. In: Jones, G., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Science(), vol 10456. Springer, Cham. https://doi.org/10.1007/978-3-319-65813-1_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-65813-1_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-65812-4

  • Online ISBN: 978-3-319-65813-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics