Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/645328.650004guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

Published: 05 September 2001 Publication History

Abstract

This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).

References

[1]
Church, K.W., Hanks, P.: Word Association Norms, Mutual Information and Lexicography. In: Proceedings of the 27th Annual Conference of the Association of Computational Linguistics, (1989) 76-83.
[2]
Church, K.W., Gale, W., Hanks, P., Hindle, D.: Using Statistics in Lexical Analysis. In: Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. New Jersey: Lawrence Erlbaum (1991) 115-164.
[3]
AltaVista, AltaVista Company, Palo Alto, California, http://www.altavista.com/.
[4]
Test of English as a Foreign Language (TOEFL), Educational Testing Service, Princeton, New Jersey, http://www.ets.org/.
[5]
Tatsuki, D.: Basic 2000 Words - Synonym Match 1. In: Interactive JavaScript Quizzes for ESL Students, http://www.aitech.ac.jp/~iteslj/quizzes/js/dt/mc-2000-01syn.html (1998).
[6]
Landauer, T.K., Dumais, S.T.: A Solution to Plato's Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104 (1997) 211-240.
[7]
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41 (1990) 391-407.
[8]
Berry, M.W., Dumais, S.T., Letsche, T.A.: Computational Methods for Intelligent Information Access. Proceedings of Supercomputing '95, San Diego, California, (1995).
[9]
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press (1999).
[10]
Firth, J.R.: A Synopsis of Linguistic Theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32. Oxford: Philological Society (1957). Reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952-1959, London: Longman (1968).
[11]
AltaVista: AltaVista Advanced Search Cheat Sheet, AltaVista Company, Palo Alto, California, http://doc.altavista.com/adv_search/syntax.html (2001).
[12]
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. Cambridge, Massachusetts: MIT Press (1998). For more information: http://www.cogsci.princeton.edu/~wn/.
[13]
Haase, K.: Interlingual BRICO. IBM Systems Journal, 39 (2000) 589-596. For more information: http://www.framerd.org/brico/.
[14]
Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Dordrecht, Netherlands: Kluwer (1998). See: http://www.hum.uva.nl/~ewn/.
[15]
Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2 (2000) 303-336.
[16]
Grefenstette, G.: Finding Semantic Similarity in Raw Text: The Deese Antonyms. In: R. Goldman, P. Norvig, E. Charniak and B. Gale (eds.), Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language. AAAI Press (1992) 61-65.
[17]
Schütze, H.: Word Space. In: S.J. Hanson, J.D. Cowan, and C.L. Giles (eds.), Advances in Neural Information Processing Systems 5, San Mateo California: Morgan Kaufmann (1993) 895-902.
[18]
Lin, D.: Automatic Retrieval and Clustering of Similar Words. In: Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, Montreal (1998) 768-773.
[19]
Richardson, R., Smeaton, A., Murphy, J.: Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words. In Proceedings of AICS Conference. Trinity College, Dublin (1994).
[20]
Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation, 49 (1993) 188-207.
[21]
Resnik, P.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11 (1998) 95-130.
[22]
Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of the 10th International Conference on Research on Computational Linguistics, Taiwan, (1997).
[23]
Brin, S., Motwani, R., Ullman, J., Tsur, S.: Dynamic Itemset Counting and Implication Rules for Market Basket Data. In: Proceedings of the 1997 ACM-SIGMOD International Conference on the Management of Data (1997) 255-264.
[24]
Sullivan, D.: Search Engine Sizes. SearchEngineWatch.com, internet.com Corporation, Darien, Connecticut, http://searchenginewatch.com/reports/sizes.html (2000).
[25]
Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent Semantic Indexing: A Probabilistic Analysis. In: Proceedings of the Seventeenth ACM-SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Seattle, Washington (1998) 159-168.
[26]
Sparck Jones, K.: Comparison Between TREC2 and TREC3. In: D. Harman (ed.), The Third Text REtrieval Conference (TREC3), National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) C1-C4.
[27]
Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: The Third Text REtrieval Conference (TREC3), D. Harman (ed.), National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) 69-80.

Cited By

View all
  • (2020)Effective Identification of Distinctive WordmarksCompanion Proceedings of the Web Conference 202010.1145/3366424.3386200(471-477)Online publication date: 20-Apr-2020
  • (2019)From word to sense embeddingsJournal of Artificial Intelligence Research10.1613/jair.1.1125963:1(743-788)Online publication date: 17-Apr-2019
  • (2018)Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity MeasurementACM Transactions on Asian and Low-Resource Language Information Processing10.1145/318262217:3(1-21)Online publication date: 2-Apr-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
EMCL '01: Proceedings of the 12th European Conference on Machine Learning
September 2001
611 pages
ISBN:3540425365

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 05 September 2001

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Effective Identification of Distinctive WordmarksCompanion Proceedings of the Web Conference 202010.1145/3366424.3386200(471-477)Online publication date: 20-Apr-2020
  • (2019)From word to sense embeddingsJournal of Artificial Intelligence Research10.1613/jair.1.1125963:1(743-788)Online publication date: 17-Apr-2019
  • (2018)Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity MeasurementACM Transactions on Asian and Low-Resource Language Information Processing10.1145/318262217:3(1-21)Online publication date: 2-Apr-2018
  • (2017)A path-based model for emotion abstraction on facebook using sentiment analysis and taxonomy knowledgeProceedings of the International Conference on Web Intelligence10.1145/3106426.3109420(947-952)Online publication date: 23-Aug-2017
  • (2017)SEMOProceedings of the International Conference on Web Intelligence10.1145/3106426.3109417(953-958)Online publication date: 23-Aug-2017
  • (2017)Corpus-Based Translation Induction in Indian Languages Using Auxiliary Language Corpora from WikipediaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/303829516:3(1-25)Online publication date: 17-Mar-2017
  • (2017)InquireProceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing10.1145/2998181.2998363(1562-1575)Online publication date: 25-Feb-2017
  • (2017)Extracting Product Features for Opinion Mining Using Public Conversations in TwitterProcedia Computer Science10.1016/j.procs.2017.08.122112:C(927-935)Online publication date: 1-Sep-2017
  • (2016)Word representation using a deep neural networkProceedings of the 26th Annual International Conference on Computer Science and Software Engineering10.5555/3049877.3049906(268-279)Online publication date: 31-Oct-2016
  • (2016)Web-based similarity for emotion recognition in web objectsProceedings of the 9th International Conference on Utility and Cloud Computing10.1145/2996890.3007883(327-332)Online publication date: 6-Dec-2016
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media