Optimizing Botanical Data Integrity: A Comparative Study of Text Similarity Methods

  • Luma G. R. Cerqueira Universidade Federal de Santa Catarina (UFSC)
  • Carina F. Dorneles Universidade Federal de Santa Catarina (UFSC)
  • Simone S. Werner Universidade Federal de Santa Catarina (UFSC)

Resumo


In this study, we address the challenges of managing authorship nomenclature as dictated by the International Code of Nomenclature for algae, fungi, and plants (ICN), within the Begoniaceae and Bignoniaceae families databases. Our goal was to evaluate various text similarity algorithms for their effectiveness in deduplicating botanical data, ensuring accuracy in authorship and synonymy. Our results highlighted Smith-Waterman’s superior balance in precision, recall, and F1 Score, suggesting its potential as a robust solution for improving database integrity. The study also demonstrates the importance of fine-tuning these algorithms to navigate the unique challenges of botanical data management, emphasizing the necessity for specialized approaches in this field.

Palavras-chave: Short Text Similarity, Botanical Databases, Similarity Function

Referências

Baeza-Yates, R. and Ribeiro-Neto, B. (2008). Modern Information Retrieval. Addison-Wesley Publishing Company, USA, 2nd edition.

Cheek, M., Nic Lughadha, E., Kirk, P., Lindon, H., Carretero, J., Looney, B., Douglas, B., Haelewaters, D., Gaya, E., Llewellyn, T., Ainsworth, A. M., Gafforov, Y., Hyde, K., Crous, P., Hughes, M., Walker, B. E., Campostrini Forzza, R., Wong, K. M., and Niskanen, T. (2020). New scientific discoveries: Plants and fungi. PLANTS, PEOPLE, PLANET, 2(5):371–388.

Glick, J. et al. (2020). Information-based similarity measures for botanical data. Journal of Data Science and Botanical Information, 8(2):101–119.

Gomaa, W. H. and Fahmy, A. A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13):13–18.

Gyawali, B., Anastasiou, L., and Knoth, P. (2020). Deduplication of scholarly documents using locality sensitive hashing and word embeddings. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 901–910, Marseille, France. European Language Resources Association (ELRA).

Manning, C. D. (2008). Introduction to information retrieval. Syngress Publishing,.

Prakoso, D. et al. (2021). Short text similarity measurement methods: A review. Journal of Big Data and Analytics in Practice, 3(1):33–44.

Silva, C. et al. (2019). Measurement of text similarity: A survey. Information, 11(421):1–25.

Silva, J. et al. (2021). Tool for validation and import in herbarium database. In Proceedings of the Botanical Data Conference, pages 123–130. Botanical Society.
Publicado
14/10/2024
CERQUEIRA, Luma G. R.; DORNELES, Carina F.; WERNER, Simone S.. Optimizing Botanical Data Integrity: A Comparative Study of Text Similarity Methods. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 406-417. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2024.240254.