SIDEAS - Detectando a Similaridade Semântica de Discursos
Resumo
Textos abundantemente inseridos em plataformas digitais atualmente podem apresentar similaridades semânticas cuja detecção automática é essencial para aplicações como identificação de plágio e análise de movimentos sociais. No entanto, a detecção de similaridade semântica entre discursos, que podem transmitir ideias análogas usando diferentes construções léxicas e sintáticas, permanece um desafio pouco explorado. Este trabalho tem como objetivo principal comparar abordagens para medir e classificar a similaridade semântica de discursos em textos curtos. Primeiramente, investiga o uso de embeddings tradicionais e contextualizados de componentes estruturais correspondentes dos discursos. Em seguida, explora o uso de modelos de linguagem para medir e classificar as similaridades diretamente nos textos brutos. A eficácia dessas abordagens foi avaliada em experimentos utilizando 3 corpora distintos. Os resultados experimentais demonstram que o uso adequado de prompts no GPT permite obter um desempenho superior ao uso de embeddings de palavras na comparação de componentes do discurso, estabelecendo assim uma base comparativa para futuros estudos nesta área.
Palavras-chave:
processamento de linguagem natural, similaridade de discurso, embeddings
Referências
Almuhaimeed, A., Alhomidi, M. A., Alenezi, M. N., Alamoud, E., and Alqahtani, S. (2022). A modern semantic similarity method using multiple resources for enhancing influenza detection. Expert Systems with Applications, 193:116466.
An, H., Wu, D., and Li, Z. (2020). Hybrid self-interactive attentive siamese network for medical textual semantic similarity - proceedings of the 2020 4th international conf. on management engineering, software engineering and service sciences. page 52–56.
Bos, J. (2015). Open-domain semantic parsing with boxer. In Proceedings of the 20th nordic conference of computational linguistics (NODALIDA 2015), pages 301–304.
Cao, S., Vo, H., Le, H. T.-T., and Dinh, D. (2022). Hybrid approach for text similarity detection in vietnamese based on sentence-bert and wordnet - proceedings of the 4th international conference on information technology and computer communications. page 59–63.
Chen, Q., Zhao, G., Wu, Y., and Qian, X. (2023). Fine-grained semantic textual similarity measurement via a feature separation network. Applied Intelligence.
Curran, J. R., Clark, S., and Bos, J. (2007). Linguistically motivated large-scale nlp with c&c and boxer. In Proceedings of the 45th annual meeting of the ACL Companion volume proceedings of the demo and poster sessions, pages 33–36.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the ACL, pages 4171–4186, Minneapolis, Minnesota. ACL.
Dolan, B. and Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Third International workshop on paraphrasing (IWP2005).
Farouk, M. (2020a). Measuring sentences similarity based on discourse representation structure. Computing and Informatics, 39(3):464–480.
Farouk, M. (2020b). Measuring text similarity based on structure and word embedding. Cognitive Systems Research, 63:1–10.
Hockenmaier, J. (2003). Data and models for statistical parsing with combinatory categorial grammar.
Jha, A., Rakesh, V., Ch, rashekar, J., Samavedhi, A., Reddy, C., and an K. (2022). Supervised contrastive learning for interpretable long-form document matching. ACM Trans. Knowl. Discov. Data. Just Accepted.
Joty, S., Carenini, G., Ng, R., and Murray, G. (2019). Discourse analysis and its applications. In Proceedings of the 57th Annual Meeting of the ACL: Tutorial Abstracts, pages 12–17.
Kamp, H. and Reyle, U. (2013). From discourse to logic: Introduction to modeltheoretic semantics of natural language, formal logic and discourse representation theory, volume 42. Springer Science & Business Media.
Lascarides, A. and Asher, N. (2007). Segmented discourse representation theory: Dynamic semantics with discourse structure. In Computing meaning, pages 87–124. Springer.
Lv, C., Wang, F., Wang, J., Yao, L., and Du, X. (2021). Siamese multiplicative lstm for semantic text similarity - 2020 3rd international conference on algorithms, computing and artificial intelligence.
Malkiel, I., Ginzburg, D., Barkan, O., Caciularu, A., Weill, J., and Koenigstein, N. (2022). Interpreting bert-based text similarity via activation and saliency maps - proceedings of the acm web conference 2022. page 3259–3268.
Marcuschi, L. A. et al. (2002). Gêneros textuais: definição e funcionalidade. Gêneros textuais e ensino, 2:19–36.
Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., and Zamparelli, R. (2014). A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 216–223, Reykjavik, Iceland. European Language Resources Association (ELRA).
Mehndiratta, A. and Asawa, K. (2020). Spectral Learning of Semantic Units in a Sentence Pair to Evaluate Semantic Textual Similarity - big Data Analytics. 12581:49–59. Series Title: Lecture Notes in Computer Science.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Orlandi, E. P. (2005). Michel pêcheux e a análise de discurso (michel pêcheux et l’analyse de discours). Estudos da Língua (gem), 1(1):9–13.
O’Shea, J., Bandar, Z., Crockett, K., and McLean, D. (2008). Pilot short text semantic similarity benchmark data set: Full listing and description. Computing.
Peng, D., Hao, B., Tang, X., Chen, Y., Sun, J., and Wang, R. (2021). Learning long-text semantic similarity with multi-granularity semantic embedding based on knowledge enhancement - proceedings of the 2020 1st international conference on control, robotics and intelligent system. page 19–25.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Sonawane, S. S. and Kulkarni, P. (2022). Concept based document similarity using graph model. International Journal of Information Technology, 14(1):311–322.
Song, W. and Liu, L. (2020). Representation learning in discourse parsing: A survey. Science China Technological Sciences, 63(10):1921–1946.
Torkanfar, N. and Azar, E. (2020). Quantitative similarity assessment of construction projects using wbs-based metrics. Advanced Engineering Informatics, 46:101179.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Wang, K., Zeng, Y., Meng, F., Feiyu, and Yang, L. (2021). Comparison between calculation methods for semantic text similarity based on siamese networks - 2021 4th international conference on data science and information technology. page 389–395.
Wang, Z. and Zhang, B. (2021). Chinese text similarity calculation model based on multi-attention siamese bi-lstm - proceedings of the 4th international conference on computer science and software engineering. page 93–98.
Wieting, J., Bansal, M., Gimpel, K., and Livescu, K. (2015). From paraphrase database to compositional paraphrase model and back. Transactions of the ACL, 3:345–358.
Xiao, Q., Qin, Y., Li, K., Tang, Z., Wu, F., and Liu, Z. (2022). An unsupervised semantic text similarity measurement model in resource-limited scenes. Information Sciences, 616:444–460.
Yang, J., Li, Y., Gao, C., and Zhang, Y. (2021). Measuring the short text similarity based on semantic and syntactic information. Future Generation Computer Systems, 114:169–180.
An, H., Wu, D., and Li, Z. (2020). Hybrid self-interactive attentive siamese network for medical textual semantic similarity - proceedings of the 2020 4th international conf. on management engineering, software engineering and service sciences. page 52–56.
Bos, J. (2015). Open-domain semantic parsing with boxer. In Proceedings of the 20th nordic conference of computational linguistics (NODALIDA 2015), pages 301–304.
Cao, S., Vo, H., Le, H. T.-T., and Dinh, D. (2022). Hybrid approach for text similarity detection in vietnamese based on sentence-bert and wordnet - proceedings of the 4th international conference on information technology and computer communications. page 59–63.
Chen, Q., Zhao, G., Wu, Y., and Qian, X. (2023). Fine-grained semantic textual similarity measurement via a feature separation network. Applied Intelligence.
Curran, J. R., Clark, S., and Bos, J. (2007). Linguistically motivated large-scale nlp with c&c and boxer. In Proceedings of the 45th annual meeting of the ACL Companion volume proceedings of the demo and poster sessions, pages 33–36.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the ACL, pages 4171–4186, Minneapolis, Minnesota. ACL.
Dolan, B. and Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Third International workshop on paraphrasing (IWP2005).
Farouk, M. (2020a). Measuring sentences similarity based on discourse representation structure. Computing and Informatics, 39(3):464–480.
Farouk, M. (2020b). Measuring text similarity based on structure and word embedding. Cognitive Systems Research, 63:1–10.
Hockenmaier, J. (2003). Data and models for statistical parsing with combinatory categorial grammar.
Jha, A., Rakesh, V., Ch, rashekar, J., Samavedhi, A., Reddy, C., and an K. (2022). Supervised contrastive learning for interpretable long-form document matching. ACM Trans. Knowl. Discov. Data. Just Accepted.
Joty, S., Carenini, G., Ng, R., and Murray, G. (2019). Discourse analysis and its applications. In Proceedings of the 57th Annual Meeting of the ACL: Tutorial Abstracts, pages 12–17.
Kamp, H. and Reyle, U. (2013). From discourse to logic: Introduction to modeltheoretic semantics of natural language, formal logic and discourse representation theory, volume 42. Springer Science & Business Media.
Lascarides, A. and Asher, N. (2007). Segmented discourse representation theory: Dynamic semantics with discourse structure. In Computing meaning, pages 87–124. Springer.
Lv, C., Wang, F., Wang, J., Yao, L., and Du, X. (2021). Siamese multiplicative lstm for semantic text similarity - 2020 3rd international conference on algorithms, computing and artificial intelligence.
Malkiel, I., Ginzburg, D., Barkan, O., Caciularu, A., Weill, J., and Koenigstein, N. (2022). Interpreting bert-based text similarity via activation and saliency maps - proceedings of the acm web conference 2022. page 3259–3268.
Marcuschi, L. A. et al. (2002). Gêneros textuais: definição e funcionalidade. Gêneros textuais e ensino, 2:19–36.
Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., and Zamparelli, R. (2014). A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 216–223, Reykjavik, Iceland. European Language Resources Association (ELRA).
Mehndiratta, A. and Asawa, K. (2020). Spectral Learning of Semantic Units in a Sentence Pair to Evaluate Semantic Textual Similarity - big Data Analytics. 12581:49–59. Series Title: Lecture Notes in Computer Science.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Orlandi, E. P. (2005). Michel pêcheux e a análise de discurso (michel pêcheux et l’analyse de discours). Estudos da Língua (gem), 1(1):9–13.
O’Shea, J., Bandar, Z., Crockett, K., and McLean, D. (2008). Pilot short text semantic similarity benchmark data set: Full listing and description. Computing.
Peng, D., Hao, B., Tang, X., Chen, Y., Sun, J., and Wang, R. (2021). Learning long-text semantic similarity with multi-granularity semantic embedding based on knowledge enhancement - proceedings of the 2020 1st international conference on control, robotics and intelligent system. page 19–25.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Sonawane, S. S. and Kulkarni, P. (2022). Concept based document similarity using graph model. International Journal of Information Technology, 14(1):311–322.
Song, W. and Liu, L. (2020). Representation learning in discourse parsing: A survey. Science China Technological Sciences, 63(10):1921–1946.
Torkanfar, N. and Azar, E. (2020). Quantitative similarity assessment of construction projects using wbs-based metrics. Advanced Engineering Informatics, 46:101179.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Wang, K., Zeng, Y., Meng, F., Feiyu, and Yang, L. (2021). Comparison between calculation methods for semantic text similarity based on siamese networks - 2021 4th international conference on data science and information technology. page 389–395.
Wang, Z. and Zhang, B. (2021). Chinese text similarity calculation model based on multi-attention siamese bi-lstm - proceedings of the 4th international conference on computer science and software engineering. page 93–98.
Wieting, J., Bansal, M., Gimpel, K., and Livescu, K. (2015). From paraphrase database to compositional paraphrase model and back. Transactions of the ACL, 3:345–358.
Xiao, Q., Qin, Y., Li, K., Tang, Z., Wu, F., and Liu, Z. (2022). An unsupervised semantic text similarity measurement model in resource-limited scenes. Information Sciences, 616:444–460.
Yang, J., Li, Y., Gao, C., and Zhang, Y. (2021). Measuring the short text similarity based on semantic and syntactic information. Future Generation Computer Systems, 114:169–180.
Publicado
14/10/2024
Como Citar
COSTA, Rita C. A. B.; BRAZ JÚNIOR, Osmar O.; FILETO, Renato.
SIDEAS - Detectando a Similaridade Semântica de Discursos. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 39. , 2024, Florianópolis/SC.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 471-484.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2024.240261.