Abstract
During the COVID-19 pandemic, a concentrated effort was made to collate published literature on SARS-Cov-2 and other coronaviruses for the benefit of the medical community. One such initiative is the COVID-19 Open Research Dataset which contains over 400,000 published research articles. To expedite access to relevant information sources for health workers and researchers, it is vital to design effective information retrieval and information extraction systems. In this article, an IR approach leveraging transformer-based models to enable question-answering and abstractive summarization is presented. Various keyword-based and neural-network-based models are experimented with and incorporated to reduce the search space and determine relevant sentences from the vast corpus for ranked retrieval. For abstractive summarization, candidate sentences are determined using a combination of various standard scoring metrics. Finally, the summary and the user query are utilized for supporting question answering. The proposed model is evaluated based on standard metrics on the standard CovidQA dataset for both natural language and keyword queries. The proposed approach achieved promising performance for both query classes, while outperforming various unsupervised baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bachina, S., Balumuri, S., Kamath, S.: Ensemble ALBERT and RoBERTa for span prediction in question answering. In: Proceedings of 59th Annual Meeting of the Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), pp. 63–68 (2021)
Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)
Bhatia, P., et al.: AWS CORD-19 search: a neural search engine for COVID-19 literature. In: Shaban-Nejad, A., Michalowski, M., Bianco, S. (eds.) W3PHAI 2021. SCI, vol. 1013, pp. 131–145. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-93080-6_11
Bhopale, A.P., Shevgoor, S.K.: Temporal topic modeling of scholarly publications for future trend forecasting. In: Reddy, P.K., Sureka, A., Chakravarthy, S., Bhalla, S. (eds.) BDA 2017. LNCS, vol. 10721, pp. 144–163. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72413-3_10
Canese, K., Weis, S.: Pubmed: the bibliographic database. The NCBI handbook, vol. 2(1) (2013)
Chen, Q., Peng, Y., Lu, Z.: Biosentvec: creating sentence embeddings for biomedical texts. In: 2019 IEEE International Conference on Healthcare Informatics, pp. 1–5. IEEE (2019)
Das, D., et al.: Information retrieval and extraction on COVID-19 clinical articles using graph community detection and bio-Bert embeddings. In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Esteva, A., et al.: COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. NPJ Digital Med. 4(1), 1–9 (2021)
Johnson, A.E., et al.: Mimic-iii, a freely accessible critical care database. Sci. Data 3(1), 160035 (2016)
Krishnan, G.S., Sowmya Kamath, S., Sugumaran, V.: Predicting vaccine hesitancy and vaccine sentiment using topic modeling and evolutionary optimization. In: Métais, E., Meziane, F., Horacek, H., Kapetanios, E. (eds.) NLDB 2021. LNCS, vol. 12801, pp. 255–263. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-80599-9_23
Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. arXiv preprint arXiv:1808.09602 (2018)
Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: CoCo@ NIPs (2016)
Nogueira, R., Jiang, Z., Lin, J.: Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713 (2020)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
Robertson, S.E., Walker, S., Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. Nist Special Publication Sp pp. 73–96 (1996)
Tang, R., et al.: Rapidly bootstrapping a question answering dataset for COVID-19. arXiv preprint arXiv:2004.11339 (2020)
Tsatsaronis, G., et al.: An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinform. 16(1), 1–28 (2015)
Upadhya, B.A., Udupa, S.: Deep neural network models for question classification in community question-answering forums. In: 2019 10th International Conference on Computing, Communication and Networking Technologies. IEEE (2019)
Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., et al.: Cord-19: The covid-19 open research dataset (2020)
Xing, W., Ghorbani, A.: Weighted pagerank algorithm. In: Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004, pp. 305–314. IEEE (2004)
Zhang, E., Gupta, N., Tang, R., Han, X., Pradeep, R., et al.: Covidex: neural ranking models and keyword search infrastructure for the COVID-19 open research dataset (2020). https://doi.org/10.48550/ARXIV.2007.07846
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shenoy, N., Nayak, P., Jain, S., Sowmya Kamath, S., Sugumaran, V. (2023). Effective Information Retrieval, Question Answering and Abstractive Summarization on Large-Scale Biomedical Document Corpora. In: Métais, E., Meziane, F., Sugumaran, V., Manning, W., Reiff-Marganiec, S. (eds) Natural Language Processing and Information Systems. NLDB 2023. Lecture Notes in Computer Science, vol 13913. Springer, Cham. https://doi.org/10.1007/978-3-031-35320-8_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-35320-8_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35319-2
Online ISBN: 978-3-031-35320-8
eBook Packages: Computer ScienceComputer Science (R0)