Article

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

Author:

Peter D. TurneyAuthors Info & Claims

EMCL '01: Proceedings of the 12th European Conference on Machine Learning

Pages 491 - 502

Published: 05 September 2001 Publication History

Abstract

This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).

References

[1]

Church, K.W., Hanks, P.: Word Association Norms, Mutual Information and Lexicography. In: Proceedings of the 27th Annual Conference of the Association of Computational Linguistics, (1989) 76-83.

[2]

Church, K.W., Gale, W., Hanks, P., Hindle, D.: Using Statistics in Lexical Analysis. In: Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. New Jersey: Lawrence Erlbaum (1991) 115-164.

[3]

AltaVista, AltaVista Company, Palo Alto, California, http://www.altavista.com/.

[4]

Test of English as a Foreign Language (TOEFL), Educational Testing Service, Princeton, New Jersey, http://www.ets.org/.

[5]

Tatsuki, D.: Basic 2000 Words - Synonym Match 1. In: Interactive JavaScript Quizzes for ESL Students, http://www.aitech.ac.jp/~iteslj/quizzes/js/dt/mc-2000-01syn.html (1998).

[6]

Landauer, T.K., Dumais, S.T.: A Solution to Plato's Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104 (1997) 211-240.

[7]

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41 (1990) 391-407.

[8]

Berry, M.W., Dumais, S.T., Letsche, T.A.: Computational Methods for Intelligent Information Access. Proceedings of Supercomputing '95, San Diego, California, (1995).

[9]

Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press (1999).

[10]

Firth, J.R.: A Synopsis of Linguistic Theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32. Oxford: Philological Society (1957). Reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952-1959, London: Longman (1968).

[11]

AltaVista: AltaVista Advanced Search Cheat Sheet, AltaVista Company, Palo Alto, California, http://doc.altavista.com/adv_search/syntax.html (2001).

[12]

Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. Cambridge, Massachusetts: MIT Press (1998). For more information: http://www.cogsci.princeton.edu/~wn/.

[13]

Haase, K.: Interlingual BRICO. IBM Systems Journal, 39 (2000) 589-596. For more information: http://www.framerd.org/brico/.

[14]

Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Dordrecht, Netherlands: Kluwer (1998). See: http://www.hum.uva.nl/~ewn/.

[15]

Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2 (2000) 303-336.

[16]

Grefenstette, G.: Finding Semantic Similarity in Raw Text: The Deese Antonyms. In: R. Goldman, P. Norvig, E. Charniak and B. Gale (eds.), Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language. AAAI Press (1992) 61-65.

[17]

Schütze, H.: Word Space. In: S.J. Hanson, J.D. Cowan, and C.L. Giles (eds.), Advances in Neural Information Processing Systems 5, San Mateo California: Morgan Kaufmann (1993) 895-902.

[18]

Lin, D.: Automatic Retrieval and Clustering of Similar Words. In: Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, Montreal (1998) 768-773.

[19]

Richardson, R., Smeaton, A., Murphy, J.: Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words. In Proceedings of AICS Conference. Trinity College, Dublin (1994).

[20]

Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation, 49 (1993) 188-207.

[21]

Resnik, P.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11 (1998) 95-130.

[22]

Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of the 10th International Conference on Research on Computational Linguistics, Taiwan, (1997).

[23]

Brin, S., Motwani, R., Ullman, J., Tsur, S.: Dynamic Itemset Counting and Implication Rules for Market Basket Data. In: Proceedings of the 1997 ACM-SIGMOD International Conference on the Management of Data (1997) 255-264.

[24]

Sullivan, D.: Search Engine Sizes. SearchEngineWatch.com, internet.com Corporation, Darien, Connecticut, http://searchenginewatch.com/reports/sizes.html (2000).

[25]

Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent Semantic Indexing: A Probabilistic Analysis. In: Proceedings of the Seventeenth ACM-SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Seattle, Washington (1998) 159-168.

[26]

Sparck Jones, K.: Comparison Between TREC2 and TREC3. In: D. Harman (ed.), The Third Text REtrieval Conference (TREC3), National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) C1-C4.

[27]

Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: The Third Text REtrieval Conference (TREC3), D. Harman (ed.), National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) 69-80.

Cited By

Das Gollapalli SJung-Jae K(2020)Effective Identification of Distinctive WordmarksCompanion Proceedings of the Web Conference 202010.1145/3366424.3386200(471-477)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366424.3386200
Camacho-Collados JPilehvar M(2019)From word to sense embeddingsJournal of Artificial Intelligence Research10.1613/jair.1.1125963:1(743-788)Online publication date: 17-Apr-2019
https://dl.acm.org/doi/10.1613/jair.1.11259
Huang DPei JZhang CHuang KMa J(2018)Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity MeasurementACM Transactions on Asian and Low-Resource Language Information Processing10.1145/318262217:3(1-21)Online publication date: 2-Apr-2018
https://dl.acm.org/doi/10.1145/3182622
Show More Cited By

Index Terms

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

Recommendations

Mining the web for synonyms: PMI-IR versus LSA on TOEFL
ECML'01: Proceedings of the 12th European Conference on Machine Learning

This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (...
Synonyms extraction using web content focused crawling
AIRS'08: Proceedings of the 4th Asia information retrieval conference on Information retrieval technology

Documents or Web pages collected from the World Wide Web have been considered one of the most important sources for information. Using search engines to retrieve the documents can harvest lots of information, facilitating information exchange and ...
Syntactic-Semantic Classes of Context-Sensitive Synonyms Based on a Bilingual Corpus
Human Language Technology. Challenges for Computer Science and Linguistics
Abstract
This paper summarizes findings of a three-year study on verb synonymy in translation based on both syntactic and semantic criteria and reports on recent results extending this work. Primary language resources used are existing Czech and English ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

EMCL '01: Proceedings of the 12th European Conference on Machine Learning

September 2001

611 pages

ISBN:3540425365

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 05 September 2001

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

236
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Das Gollapalli SJung-Jae K(2020)Effective Identification of Distinctive WordmarksCompanion Proceedings of the Web Conference 202010.1145/3366424.3386200(471-477)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366424.3386200
Camacho-Collados JPilehvar M(2019)From word to sense embeddingsJournal of Artificial Intelligence Research10.1613/jair.1.1125963:1(743-788)Online publication date: 17-Apr-2019
https://dl.acm.org/doi/10.1613/jair.1.11259
Huang DPei JZhang CHuang KMa J(2018)Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity MeasurementACM Transactions on Asian and Low-Resource Language Information Processing10.1145/318262217:3(1-21)Online publication date: 2-Apr-2018
https://dl.acm.org/doi/10.1145/3182622
Franzoni VLi YMengoni PSheth ANgonga AWang yChang EŚlęzak DFranczyk BAlt RTao X(2017)A path-based model for emotion abstraction on facebook using sentiment analysis and taxonomy knowledgeProceedings of the International Conference on Web Intelligence10.1145/3106426.3109420(947-952)Online publication date: 23-Aug-2017
https://dl.acm.org/doi/10.1145/3106426.3109420
Franzoni VMilani ABiondi GSheth ANgonga AWang yChang EŚlęzak DFranczyk BAlt RTao X(2017)SEMOProceedings of the International Conference on Web Intelligence10.1145/3106426.3109417(953-958)Online publication date: 23-Aug-2017
https://dl.acm.org/doi/10.1145/3106426.3109417
Tholpadi GBhattacharyya CShevade S(2017)Corpus-Based Translation Induction in Indian Languages Using Auxiliary Language Corpora from WikipediaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/303829516:3(1-25)Online publication date: 17-Mar-2017
https://dl.acm.org/doi/10.1145/3038295
Paredes PRufino Ferreira ASchillaci CYoo GKarashchuk PXing DCheshire CCanny JLee CPoltrock SBarkhuus LBorges MKellogg W(2017)InquireProceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing10.1145/2998181.2998363(1562-1575)Online publication date: 25-Feb-2017
https://dl.acm.org/doi/10.1145/2998181.2998363
Othman RBelkaroui RFaiz R(2017)Extracting Product Features for Opinion Mining Using Public Conversations in TwitterProcedia Computer Science10.1016/j.procs.2017.08.122112:C(927-935)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1016/j.procs.2017.08.122
Li YLyons KMindel MMüller HOnut V(2016)Word representation using a deep neural networkProceedings of the 26th Annual International Conference on Computer Science and Software Engineering10.5555/3049877.3049906(268-279)Online publication date: 31-Oct-2016
https://dl.acm.org/doi/10.5555/3049877.3049906
Biondi GFranzoni VLi YMilani AJiang CRana OAntonopoulos N(2016)Web-based similarity for emotion recognition in web objectsProceedings of the 9th International Conference on Utility and Cloud Computing10.1145/2996890.3007883(327-332)Online publication date: 6-Dec-2016
https://dl.acm.org/doi/10.1145/2996890.3007883
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents