Abstract
In this paper, we address on the task of sentence paraphrase detection which is focused on deciding whether the two sentences have the relationship of paraphrase. A supervised learning strategy for paraphrase detection is described whereby the two sentences are classified to decide the paraphrase relationship and using only the lexical features operated at n-gram as the classification features. Gradient Boosting, K-Nearest Neighbor, Decision Tree and Support vector machine are chosen as the classifiers. The performance of the classification method is compared and the features are analyzed to determine which of them are most important for paraphrase detection. Evaluation is performed on the corpus of 2016 Detecting Paraphrase in Indian Languages task proposed by Forum of Information Retrieval Evaluation (DPIL-FIRE2016). The experimental results show that the Gradient Boosting can achieve the highest Overall Score. By using the learned classifier, we got the highest F1 measure for both Task1 and Task2 on Malayalam and Tamil, and the highest F1 measure for Task2 on Punjabi in DPIL-FIRE2016.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Huang, E.: Paraphrase detection using recursive autoencoder (2011). http://nlp.stanford.edu/courses/cs224n/2011/reports/ehhuang.pdf
Anand Kumar, M., Singh, S., Kavirajan, B., et al.: DPIL@FIRE2016: overview of shared task on detecting paraphrases in Indian Languages. In: Working notes of FIRE 2016–Forum for Information Retrieval Evaluation, Kolkata, India, 7–10 December 2016
Alzahrani, S.M., Salim, N., Abraham, A.: Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(2), 133–149 (2012)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, vol. 29, pp. 65–72 (2005)
McClendon, J.L., Mack, N.A., Hodges, L.F.: The use of paraphrase identification in the retrieval of appropriate responses for script based conversational agents. In: FLAIRS Conference (2014)
Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of IWP (2005)
Lintean, M., Rus, V.: Paraphrase identification using weighted dependencies and word semantics. In: Twenty-Second International FLAIRS Conference (2009)
Socher, R., Huang, E.H., Pennin, J., et al.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Hu, B., Lu, Z., Li, H., et al.: Convolutional neural network architectures for matching natural language sentences. In: Advances in Neural Information Processing Systems, pp. 2042–2050 (2014)
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)
Rus, V., McCarthy, P.M., Lintean, M.C., McNamara, D.S., Graesser, A.C.: Paraphrase identification with lexico-syntactic graph subsumption. In: FLAIRS Conference, pp. 201–206 (2008)
Saini, A., Gurgaon, H.: Anuj@DPIL-FIRE2016: a novel paraphrase detection method in Hindi Language using machine learning. In: FIRE (Working Notes), pp. 270–274 (2016)
Sarkar, K.: KS_JU@DPIL-FIRE2016: detecting paraphrases in Indian Languages using multinomial logistic regression model. arXiv preprint arXiv:1612.08171 (2016)
Thangarajan, R., Kogilavani, S.V., Karthic, A., et al.: KEC@ DPIL-FIRE2016: detection of paraphrases on Indian Languages (2016)
Sarkar, S., Saha, S., Bentham, J., et al.: NLP-NITMZ@DPIL-FIRE2016: Language independent paraphrases detection. In: FIRE (Working Notes), pp. 256–259 (2016)
Friedl, M.A., Brodley, C.E.: Decision tree classification of land cover from remotely sensed data. Remote Sens. Environ. 61(3), 399–409 (1997)
Peterson, L.E.: K-nearest neighbor. Scholarpedia 4(2), 1883 (2009)
Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res. 5(Aug), 975–1005 (2004)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232 (2001)
Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002)
Li, H.: Statistical learning methods. Tsinghua University Press, Beijing (2012). (in Chinese)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Wang, J., Li, G., Feng, J.: Fast-join: an efficient method for fuzzy token matching based string similarity join. In: ICDE, pp. 458–469 (2011)
Acknowledgments
This work is supported by Social Science Fund of Heilongjiang Province (NO. 16XWB02).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Tian, L., Ning, H., Kong, L., Chen, K., Qi, H., Han, Z. (2018). Sentence Paraphrase Detection Using Classification Models. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-73606-8_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)