Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2736277.2741627acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Statistically Significant Detection of Linguistic Change

Published: 18 May 2015 Publication History

Abstract

We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word's meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algorithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector representations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time.
We demonstrate that our approach is scalable by tracking linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book Ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium.

References

[1]
R. P. Adams and D. J. MacKay. Bayesian online change-point detection. Cambridge, UK, 2007.
[2]
R. Al-Rfou, B. Perozzi, and S. Skiena. Polyglot: Distributed word representations for multilingual nlp. In CoNLL, 2013.
[3]
M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1993.
[4]
Y. Bengio, H. Schwenk, et al. Neural probabilistic language models. In Innovations in Machine Learning, pages 137--186. Springer, 2006.
[5]
Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798--1828, 2013.
[6]
L. Bottou. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Names. EC2, Nimes, France, 1991. EC2.
[7]
H. A. Carneiro and E. Mylonakis. Google trends: A web-based tool for real-time surveillance of disease outbreaks. Clinical Infectious Diseases, 49(10):1557--1564, 2009.
[8]
Y. Chen, B. Perozzi, R. Al-Rfou, and S. Skiena. The expressive power of word embeddings. CoRR, abs/1301.3226, 2013.
[9]
H. Choi and H. Varian. Predicting the present with google trends. Economic Record, 88:2--9, 2012.
[10]
R. Collobert, J. Weston, et al. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12: 2493--2537, Nov. 2011.
[11]
D. Crystal. Internet Linguistics: A Student Guide. Routledge, New York, NY, 10001, 1st edition, 2011.
[12]
B. Efron and R. J. Tibshirani. An introduction to the bootstrap. 1971.
[13]
J. R. Firth. Papers in Linguistics 1934-1951: Repr. Oxford University Press, 1961.
[14]
Y. Goldberg and J. Orwant. A dataset of syntactic ngrams over time from a very large corpus of english books. In *SEM, 2013.
[15]
K. Gulordava and M. Baroni. A distributional similarity approach to the detection of semantic change in the google books ngram corpus. In GEMS, July 2011.
[16]
D. Immerwahr. The books of the century, 2014. URL http://www.ocf.berkeley.edu/~immer/books1970s.
[17]
A. Jatowt and K. Duh. A framework for analyzing semantic change of words across time. In Proceedings of the Joint JCDL/TPDL Digital Libraries Conference, 2014.
[18]
P. Juola. The time course of language change. Computers and the Humanities, 37(1):77--96, 2003.
[19]
Y. Kim, Y.-I. Chiu, K. Hanaki, et al. Temporal analysis of language through neural language models. In ACL, 2014.
[20]
J. Lijffijt, T. Saily, and T. Nevalainen. Ceecing the baseline: Lexical stability and significant change in a historical corpus. VARIENG, 2012.
[21]
J. Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37 (1):145--151, 1991.
[22]
Y. Lin, J. B. Michel, E. L. Aiden, J. Orwant, W. Brockman, and S. Petrov. Syntactic annotations for the google books ngram corpus. In ACL, 2012.
[23]
J. Mann, D. Zhang, et al. Enhanced search with wildcards and morphological inflections in the google books ngram viewer. In Proceedings of ACL Demonstrations Track. Association for Computational Linguistics, June 2014.
[24]
G. Merchant. Teenagers in cyberspace: an investigation of language use and language change in internet chatrooms. Journal of Research in Reading, 24:293--306, 2001.
[25]
J. B. Michel, Y. K. Shen, et al. Quantitative analysis of culture using millions of digitized books. Science, 331 (6014):176--182, 2011.
[26]
T. Mikolov et al. Linguistic regularities in continuous space word representations. In Proceedings of NAACLHLT, 2013.
[27]
T. Mikolov et al. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
[28]
T. Mikolov et al. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.
[29]
S. Mitra, R. Mitra, et al. That's sick dude!: Automatic identification of word sense change across different timescales. In ACL, 2014.
[30]
A. Mnih and G. E. Hinton. A scalable hierarchical distributed language model. NIPS, 21:1081--1088, 2009.
[31]
F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In Proceedings of the inter- national workshop on artificial intelligence and statistics, pages 246--252, 2005.
[32]
B. Perozzi, R. Al-Rfou, V. Kulkarni, and S. Skiena. Inducing language networks from continuous space word representations. In Complex Networks V, volume 549 of Studies in Computational Intelligence, pages 261--273. 2014.
[33]
B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In KDD, New York, NY, USA, August 2014. ACM.
[34]
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Cognitive modeling, 1:213, 2002.
[35]
T. Saily, T. Nevalainen, and H. Siirtola. Variation in noun and pronoun frequencies in a sociohistorical corpus of english. Literary and Linguistic Computing, 26(2): 167--188, 2011.
[36]
D. J. Schiano, C. P. Chen, E. Isaacs, J. Ginsberg, U. Gretarsdottir, and M. Huddleston. Teen use of messaging media. In Computer Human Interaction, pages 594--595, 2002.
[37]
S. A. Tagliamonte and D. Denis. Linguistc Ruin? LOL! Instant messaging and teen language. American Speech, 83:3--34, 2008.
[38]
W. A. Taylor. Change-point analysis: A powerful new tool for detecting changes, 2000.
[39]
D. T. Wijaya and R. Yeniterzi. Understanding semantic change of words over centuries. In DETECT, 2011.

Cited By

View all
  • (2024)The impact of the #MeToo movement on language at court A text-based causal inference approachPLOS ONE10.1371/journal.pone.030282719:5(e0302827)Online publication date: 15-May-2024
  • (2024)Evolution of the Moral LexiconOpen Mind10.1162/opmi_a_001648(1153-1169)Online publication date: 15-Sep-2024
  • (2024)Narrative Characteristics in Refugee Discourse: An Analysis of American Public Opinion on the Afghan Refugee Crisis After the Taliban TakeoverProceedings of the ACM on Human-Computer Interaction10.1145/36537038:CSCW1(1-31)Online publication date: 26-Apr-2024
  • Show More Cited By

Index Terms

  1. Statistically Significant Detection of Linguistic Change

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '15: Proceedings of the 24th International Conference on World Wide Web
    May 2015
    1460 pages
    ISBN:9781450334693

    Sponsors

    • IW3C2: International World Wide Web Conference Committee

    In-Cooperation

    Publisher

    International World Wide Web Conferences Steering Committee

    Republic and Canton of Geneva, Switzerland

    Publication History

    Published: 18 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. computational linguistics
    2. web mining

    Qualifiers

    • Research-article

    Funding Sources

    • NSF
    • Google Faculty Research Award
    • Renaissance Technologies Fellowship
    • Institute for Computational Science at Stony Brook University

    Conference

    WWW '15
    Sponsor:
    • IW3C2

    Acceptance Rates

    WWW '15 Paper Acceptance Rate 131 of 929 submissions, 14%;
    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)230
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)The impact of the #MeToo movement on language at court A text-based causal inference approachPLOS ONE10.1371/journal.pone.030282719:5(e0302827)Online publication date: 15-May-2024
    • (2024)Evolution of the Moral LexiconOpen Mind10.1162/opmi_a_001648(1153-1169)Online publication date: 15-Sep-2024
    • (2024)Narrative Characteristics in Refugee Discourse: An Analysis of American Public Opinion on the Afghan Refugee Crisis After the Taliban TakeoverProceedings of the ACM on Human-Computer Interaction10.1145/36537038:CSCW1(1-31)Online publication date: 26-Apr-2024
    • (2024)Diachronic Analysis of a Word Concreteness Rating: Impact of Semantic ChangeLobachevskii Journal of Mathematics10.1134/S199508022460055945:3(961-971)Online publication date: 19-Jul-2024
    • (2024)Identifying Citizen Interests During the COVID-19 Pandemic Using Context Change in Twitter Conversations2024 Tenth International Conference on eDemocracy & eGovernment (ICEDEG)10.1109/ICEDEG61611.2024.10702049(1-9)Online publication date: 24-Jun-2024
    • (2024)Multi-domain routing in Delay Tolerant Networks2024 IEEE Aerospace Conference10.1109/AERO58975.2024.10521176(1-20)Online publication date: 2-Mar-2024
    • (2024)Evaluation of word embedding models used for diachronic semantic change analysisJournal of Physics: Conference Series10.1088/1742-6596/2701/1/0120822701:1(012082)Online publication date: 1-Feb-2024
    • (2024)Measuring Political Narratives in African News Media: A Word Embeddings ApproachThe Journal of Politics10.1086/72759386:3(1087-1092)Online publication date: 1-Jul-2024
    • (2024)Natural Language Processing for Ancient GreekDiachronica10.1075/dia.23013.sto41:3(414-435)Online publication date: 2-Jul-2024
    • (2024)An Embedded Diachronic Sense Change Model with a Case Study from Ancient GreekComputational Statistics & Data Analysis10.1016/j.csda.2024.108011(108011)Online publication date: Jun-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media