Abstract
We report results of stylistic differences in blogging for gender and age group variation. The results are based on two mutually independent features. The first feature is the use of slang words which is a new concept proposed by us for Stylistic study of bloggers. For the second feature, we have analyzed the variation in average length of sentences across various age groups and gender. These features are augmented with previous study results reported in literature for stylistic analysis. The combined feature list enhances the accuracy by a remarkable extent in predicting age and gender. These machine learning experiments were done on two separate demographically tagged blog corpus. Gender determination is more accurate than age group detection over the data spread across all ages but the accuracy of age prediction increases if we sample data with remarkable age difference.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
ICWSM 2009, Spinn3r Dataset (May 2009)
Argamon, S., Koppel, M., Avneri, G.: Routing documents according to style. In: Proc. of First Int. Workshop on Innovative Inform. Syst. (1998)
Spinn3r Indexing Blogosphere, www.spinn3r.com (last accessed on March 01, 2009)
Brank, J., Grobelnik, M., Milic-Frayling, N., Mladenic, D.: Feature selection using support vector machines. In: Proc. of the 3rd Int. Conf. on Data Mining Methods and Databases for Eng., Finance, and Other Fields, pp. 84–89 (2002)
Corney, M., de Vel, O., Anderson, A., Mohay, G.: Gender-preferential text mining of e-mail discourse. In: 18th Annual Computer Security Appln. Conference (2002)
Burger, J.D., Henderson, J.C.: An exploration of observable features related to blogger age. In: Proc. of the AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs (2006)
Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Tat: an author profiling tool with application to arabic emails. In: Proc. of the Australasian Language Technology Workshop, pp. 21–30 (2007)
Ispell (2009), http://www.gnu.org/software/ispell/ (last accessed on April 02, 2009)
Holmes, J.: Women’s talk: The question of sociolinguistic universals. Australian Journal of Communications 20(3) (1993)
Simkins-Bullock, J., Wildman, B.: An investigation into relationship between gender and language Sex Roles 24. Springer, Netherlands (1991)
Pennebaker, J.W., Francis, M.E., Booth, R.J.: Linguistic Inquiry and Word Count. In: LIWC 2001 (2001)
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4), 401–412 (2002)
Palander-Collin, M.: Male and female styles in 17th century correspondence: I think. Language Variation and Change 11, 123–141 (1999)
Pennebaker, J.W., Stone, L.D.: Words of wisdom: Language use over the lifespan. Journal of Personality and Social Psychology 85, 291–301 (2003)
McMenamin, G.R.: Forensic Linguistics: Advances in Forensic Stylistic. CRC Press, Boca Raton (2002)
Datta, S., Sarkar, S.: A comparative study of statistical features of language in blogs-vs-splogs. In: AND 2008: Proc. of the second workshop on Analytics for noisy unstructured text data, pp. 63–66. ACM, New York (2008)
Goswami, S., Sarkar, S., Rustagi, M.: Stylometric analysis of bloggers’ age and gender. To appear in: Proc. of ICWSM (2009)
Herring, S.: Two variants of an electronic message schema. In: Herring, S. (ed.) Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives, vol. 11, pp. 81–106 (1996)
Argamon, S., Schler, J., Koppel, M., Pennebaker, J.: Effects of age and gender on blogging. In: Proc. of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs (April 2006)
Leximancer Manual V.3, www.leximancer.com (last accessed on January 22, 2009)
Witten, I.H., Frank, E.: DataMining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Yan, R.: Gender classification of weblog authors with bayesian analysis. In: Proc. of the AAAI Spring Symp. on Computational Approaches to Analyzing Weblogs (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rustagi, M., Prasath, R.R., Goswami, S., Sarkar, S. (2009). Learning Age and Gender of Blogger from Stylistic Variation. In: Chaudhury, S., Mitra, S., Murthy, C.A., Sastry, P.S., Pal, S.K. (eds) Pattern Recognition and Machine Intelligence. PReMI 2009. Lecture Notes in Computer Science, vol 5909. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11164-8_33
Download citation
DOI: https://doi.org/10.1007/978-3-642-11164-8_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11163-1
Online ISBN: 978-3-642-11164-8
eBook Packages: Computer ScienceComputer Science (R0)