Abstract
This letter discusses the limitations of the use of filters to enhance the accuracy of the extraction of parenthetic abbreviations from scholarly publications and proposes the usage of the parentheses level count algorithm to efficiently extract entities between parentheses from raw texts as well as of machine learning-based supervised classification techniques for the identification of biomedical abbreviations to significantly reduce the removal of acronyms including disallowed punctuations.
Keywords: data mining, information retrieval, acronym extraction, parenthetic acronym
We recently read with a lot of interest the research work of Lu et al published in Journal of the American Medical Informatics Association about the development of the UMLS SPECIALIST lexicon and lexical tools for natural language processing of biomedical texts.1 We were highly impressed by the efficiency of n-grams to largely expand biomedical semantic resources by including unsupported acronyms or synonyms of biomedical concepts and by the important precision and the CUI-endword matcher model, in performing such an interesting task.1 Despite the important outcomes of the research work, we found that the method that has been used to extract the parenthetic abbreviations depends on a various number of filters that certainly ameliorate the precision of the process (71.09% precision for this research paper) but harm the recall and speed of the extraction procedure.2
On one hand, the elimination of the entities including banned symbols (disallowed punctuation marks are {, }, _, !, @, #, *, \, ;, ”, ?, ∼, =, |, <, >, $, `, and ^) using the Disallowed Punctuation Filter can cause the nonconsideration of several acronyms that include such punctuation marks.2 These abbreviations include the names of several institutions (eg, European Association of Urology [@Uroweb] and European Network for the Study of Adrenal Tumors [ENS@T]), medical tests and systems (eg, electronic prescription on Catalonia [Rec@t], memory alteration test [M@T], and DichroMatch at the protein circular dichroism data bank [DM@PCDDB]), and even biological processes and compounds (eg Immunoglobulin heavy chain [IGH@]). Here, machine learning can be used to classify the entities concerned with this filter as abbreviations or non-abbreviations. The techniques that can be used for this purpose range from random forest, gradient boost trees, and support vector machines to deep learning.3 These methods can significantly enhance the coverage of biomedical abbreviations including symbols and consequently ameliorate the recall of the retrieval of parenthetic acronyms while preserving the overall accuracy of the extraction.4
On the other hand, the use of an incomplete pattern filter to remove the entities that have uncompleted parentheses or brackets proves that the extraction method retrieves character strings regardless of whether the closing parenthesis matches the opening parenthesis or not (eg, chi-squared test [chi(2)]). Such a behavior can cause useless and time-consuming extraction and then removal of several textual strings particularly because many biomedical abbreviations currently include parentheses to transcribe number subscript (eg, corrected cardio-ankle vascular index [CAVI(0)], extracorporeal carbon dioxide removal [ECCO(2)R], 50% cytotoxic concentration [CC(50)], cisplatin-cyclophosphamide-doxorubicin [CISCA(II)], and vinblastine-bleomycin [VB(IV)]), numbers (eg, corticotropin-releasing factor-1 [CRF(1)]), trademark symbols (eg, COPD assessment test [CAT(TM)]), letter superscripts (eg, mass spectrometry [MS(E)], or lowercase letters (eg, anthrax edema factor [EF(N)] and fractional exhaled nitric oxide [F(E)NO]). A letter can also be between parentheses when the word it stands for is also put between parentheses (eg, [modified] Edinburgh-2 Coma Scale [E2CS(M)]). The retrieval of incomplete parenthetic entities can be easily prevented by applying an appropriate and nonrandom method for the extraction of the abbreviations between parentheses as shown in Algorithm 1.
Algorithm 1: Proposed algorithm for extracting entities between parentheses from raw texts
Input:s:text
Output:t:listof the entities between parentheses
Algorithm PLA
Variable
i,n,l:integer
abb:string
Begin
l ← 0
abb ← ““
n ← 0
Fori = 1tolength(s)do
if(s[i] =“(“)thenl ← l + 1
if(s[i] =“)”)thenl ← l–1
if(l ≠ 0)thenabb ← abb+s[i]
if(l = 0)and(abb≠““)then begin
n ← n + 1
t[n] ← abb
abb ← ““
End if
End for
End
This algorithm is called parentheses level count and has been used for years by scientific calculators to handle parentheses in mathematical equations.5 It extracts a character string only when the opening parenthesis matches the closing parenthesis and when the 2 parentheses are not stuck between an open parenthesis and a closing parenthesis. This algorithm will represent entities having a missing parenthesis or corresponding to explanatory excerpts as long texts which can be easily identified, automatically adjusted, and then processed to retrieve more biomedical abbreviations.
So, the use of filters to ameliorate the precision of the automatic extraction of parenthetic biomedical abbreviations is not always the best option to choose, particularly as these filters will alter the runtime and recall of the whole process. That is why we invite computer scientists to enhance parenthetic text extraction methods in the next years by including successful algorithms from the wide field of information retrieval and processing, such as the parentheses level count and the recent applications of machine and deep learning.
FUNDING
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
AUTHOR CONTRIBUTIONS
All coauthors have equally contributed to the development of this research output.
CONFLICT OF INTEREST STATEMENT
None declared.
DATA AVAILABILITY
An exhaustive dataset of biomedical abbreviations is generated by applying the algorithm highlighted in this letter to the titles of PubMed-indexed scholarly publications. This resource was used to identify the examples highlighted by this output and can be found in the Database of Parenthetic Biomedical Abbreviations at https://dx.doi.org/10.5281/zenodo.4281524.
REFERENCES
- 1. Lu CJ, Payne A, Mork JG.. The unified medical language system SPECIALIST lexicon and lexical tools: development and applications. J Am Med Inform Assoc 2020; 27 (10): 1600–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Lu CJ, Tormey D, Mccreedy L, et al. Generating a distilled N-gram set: effective lexical multiword building in the SPECIALIST lexicon. Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies 2017; 5: 77–87. doi: 10.5220/0006142000770087 [Google Scholar]
- 3. Bakken S. Innovation is key for advancing the science of biomedical and health informatics and for publishing in JAMIA. J Am Med Inform Assoc 2020; 27 (3): 341–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Okazaki N, Ananiadou S.. Building an abbreviation dictionary using a term recognition approach. Bioinformatics 2006; 22 (24): 3089–95. [DOI] [PubMed] [Google Scholar]
- 5. Harrison HW. Calculator with algebraic operating system. U.S. Patent No. 4,208,720.1980.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
An exhaustive dataset of biomedical abbreviations is generated by applying the algorithm highlighted in this letter to the titles of PubMed-indexed scholarly publications. This resource was used to identify the examples highlighted by this output and can be found in the Database of Parenthetic Biomedical Abbreviations at https://dx.doi.org/10.5281/zenodo.4281524.