Abstract
We present an unsupervised phrase relatedness function (f) that has been applied in a Semantic Textual Similarity system (TrWP) of SemEval-2015. The best run of TrWP was ranked 33 among 73 runs. f finds the relatedness strength between two phrases using overlapping bi-gram context extracted from the Google-n-gram corpus. The relatedness strength is the strength of association capturing how similar or dissimilar two phrases are. In order to find the relatedness strength, f applies a sum-ratio (SR) technique based on the statistics of the overlapping n-grams associated with two input phrases. The experimental result from f demonstrates improvement over existing phrase relatedness methods on two standard datasets of 216 phrase-pairs. f does not require any human annotated resource and is independent of the syntactic structure of phrases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We use ‘relatedness’ and ‘similarity’ interchangeably in our paper, albeit ‘similarity’ is a special case or a subset of ‘relatedness’.
- 2.
We use the term Sum-Ratio as the weighted mean of two numbers.
- 3.
Perform pruning on the bi-gram contexts implies to the pruning of the Google-n-grams from which those contexts are extracted.
- 4.
We prefer Pearson’s r to Spearman’s \(\rho \) because Agirre et al. [28] stated that Pearson’s r is more informative than Spearman’s \(\rho \). Spearman’s \(\rho \) considers the rank differences while Pearson’s r takes into account the value differences. Moreover, SemEval-2013 [28] used Pearson’s r for evaluation task.
- 5.
Pearson’s r is not computed using Mitchell and Lapata’s [7] system due to the unavailability of their individual phrase-pair score. Moreover, in an attempt to reproduce Mitchell and Lapata’s [7] method, Hartung and Frank [6] get Spearman’s \(\rho = 0.34\) instead of \(\rho =0.46\) on 108 adjective-noun pairs.
References
Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to web search results. In: Proceedings of the Eighth International Conference on World Wide Web, WWW 1999, New York, USA, pp. 1361–1374 (1999)
Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)
Charniak, E.: Statistical Language Learning. MIT Press, Cambridge (1993)
Hammouda, K., Kamel, M.: Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng. 16(10), 1279–1296 (2004)
Pera, M.S., Ng, Y.K.: Spamed: a spam e-mail detection approach based on phrase similarity. J. Am. Soc. Inf. Sci. Technol. 60(2), 393–409 (2009)
Hartung, M., Frank, A.: Assessing interpretable, attribute-related meaning representations for adjective-noun phrases in a similarity prediction task. In: Proceedings of the GEMS 2011 Workshop, Stroudsburg, PA, USA, pp. 52–61(2011)
Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cogn. Sci. 34(8), 1388–1429 (2010)
Baroni, M.: Composition in distributional semantics. Lang. Linguist. Compass 7(10), 511–522 (2013)
Annesi, P., Storch, V., Basili, R.: Space projections as distributional models for semantic composition. In: Gelbukh, A. (ed.) CICLing 2012, Part I. LNCS, vol. 7181, pp. 323–335. Springer, Heidelberg (2012)
Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J.: UMBC_EBIQUITY-CORE: semantic textual similarity systems. In: Proceedings of the Second Joint Conference on Lexical and Computational Semantics, June 2013
Tsatsaronis, G., Varlamis, I., Vazirgiannis, M., Nørvåg, K.: Omiotis: a thesaurus-based measure of text relatedness. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 742–745. Springer, Heidelberg (2009)
Bollegala, D., Matsuo, Y., Ishizuka, M.: A web search engine-based approach to measure semantic similarity between words. IEEE Trans. Knowl. Data Eng. 23(7), 977–990 (2011)
Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007)
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth ICML, ICML 1998, San Francisco, CA, USA, pp. 296–304 (1998)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, p. 491. Springer, Heidelberg (2001)
Rakib, M.R.H., Islam, A., Milios, E.: TrWP: text relatedness using word and phrase relatedness. In: Proceedings of the SemEval 2015, Colorado, pp. 90–95 (2015)
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, ACL 1998, pp. 768–774 (1998)
Brants, T., Franz, A.: Web 1T 5-gram corpus version 1.1. Linguistic Data Consortium (2006)
Reddy, S., Klapaftis, I., McCarthy, D., Manandhar, S.: Dynamic and static prototype vectors for semantic composition. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, Thailand, pp. 705–713, November 2011
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Methods Instrum. Comput. 28(2), 203–208 (1996)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Vilares, M., Ribadas, F.J., Vilares, J.: Phrase similarity through the edit distance. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds.) DEXA 2004. LNCS, vol. 3180, pp. 306–317. Springer, Heidelberg (2004)
Islam, A., Milios, E., Kešelj, V.: Comparing word relatedness measures based on google-n-grams. In: COLING (Posters), pp. 495–506 (2012)
Gracia, J., Trillo, R., Espinoza, M., Mena, E.: Querying the web: a multiontology disambiguation method. In: Proceedings of the 6th International Conference on Web Engineering, ICWE 2006, pp. 241–248. ACM, New York (2006)
Bohm, G., Zech, G.: Introduction to statistics and data analysis for physicists. DESY (2010)
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: *SEM 2013 shared task: semantic textual similarity. In: Second Joint Conference on Lexical and Computational Semantics, Atlanta, Georgia, USA, pp. 32–43, June 2013
Zou, G.Y.: Toward using confidence intervals to compare correlations. Psychol. Methods 12(4), 399–413 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Rakib, M.R.H., Islam, A., Milios, E. (2016). f: Phrase Relatedness Function Using Overlapping Bi-gram Context. In: Khoury, R., Drummond, C. (eds) Advances in Artificial Intelligence. Canadian AI 2016. Lecture Notes in Computer Science(), vol 9673. Springer, Cham. https://doi.org/10.1007/978-3-319-34111-8_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-34111-8_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-34110-1
Online ISBN: 978-3-319-34111-8
eBook Packages: Computer ScienceComputer Science (R0)