Abstract
The immense growth in online research publications has attracted the research community to extract valuable information from scientific resources by exploring online digital libraries and publishers’ websites. The metadata stored in a machine comprehendible form can facilitate a precise search to enlist most related articles by applying semantic queries to the document’s metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. The research community in recent years has adopted different approaches to extract structural information from research documents. We have distributed the content of an article into two logical layouts and metadata levels. This strategy has given our technique an advantage over the state-of-the-art (SOTA) extracting metadata with diversified publication styles. The experimental results have revealed that the proposed approach has shown a significant gain in performance of 20.26% to 27.14%.
Similar content being viewed by others
References
Ahmed, M. W., & Afzal, M. T. (2020). FLAG-PDFe: Features oriented metadata extraction framework for scientific publications. IEEE Access, 8, 99458–99469.
Berg, Ø. R., Oepen, S., & Read, J. (2012). Towards high-quality text stream extraction from pdf: Technical background to the acl 2012 contributed task. In Proceedings of the ACL-2012 special workshop on rediscovering 50 years of discoveries (pp. 98–103). Association for Computational Linguistics.
Böschen, I. (2021). Software review: The jatsdecoder package–extract metadata, abstract and sectioned text from niso-jats coded xml documents; insights to pubmed central’s open access database. Scientometrics, 126(12), 9585–9601.
Constantin, A., Pettifer, S., & Voronkov, A. (2013). Pdfx: Fully-automated pdf-to-xml conversion of scientific literature. In Proceedings of the 2013 ACM symposium on document engineering (pp. 177–180). ACM.
Councill, I. G., Giles, C. L., & Kan, M.-Y. (2008). Parscit: An open-source crf reference string parsing package. LREC, 8, 661–667.
Déjean, H. & Meunier, J.-L. (2006). A system for converting pdf documents into structured xml format. In International workshop on document analysis systems (pp. 129–140). Springer.
Dimou, A., Di Iorio, A., Lange, C., & Vahdati, S. (2016). Semantic publishing challenge—Assessing the quality of scientific output in its ecosystem. In A. Dimou, A. Di Iorio, C. Lange, & S. Vahdati (Eds.), Semantic web evaluation challenge (pp. 243–254). Springer.
Do, H. H. N., Chandrasekaran, M. K., Cho, P. S., & Kan, M. Y. (2013). Extracting and matching authors and affiliations in scholarly documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries (pp. 219–228). ACM.
Granitzer, M., Hristakeva, M., Jack, K., & Knight, R. (2012). A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In Proceedings of the 27th annual ACM symposium on applied computing (pp. 962–964). ACM.
Jinha, A. E. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 23(3), 258–263.
Johnson, R., Watkinson, A., & Mabe, M. (2018). The stm report. An overview of scientific and scholarly publishing (5th ed.). STM Association.
Kiss, T., & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4), 485–525.
Klampfl, S., Granitzer, M., Jack, K., & Kern, R. (2014). Unsupervised document structure analysis of digital scientific articles. International Journal on Digital Libraries, 14(3–4), 83–99.
Klink, S., & Kieninger, T. (2001). Rule-based document structure understanding with a fuzzy combination of layout and textual features. International Journal on Document Analysis and Recognition, 4(1), 18–26.
Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., & Petrov, S. (2022). Syntactic annotations for the google books ngram corpus.
Luong, M. T., Nguyen, T. D., & Kan, M. Y. (2012). Logical structure recovery in scholarly articles with rich document features. In Multimedia storage and retrieval innovations for digital library systems (pp. 270–292). IGI Global.
Ma, K. (2018). Automatic literature metadata extraction from datacite services. Recent Patents on Computer Science, 11(1), 25–31.
Ramakrishnan, C., Patnia, A., Hovy, E., & Burns, G. A. (2012). Layout-aware text extraction from full-text pdf of scientific articles. Source Code for Biology and Medicine, 7(1), 7.
Rebholz-Schuhmann, D., Oellrich, A., & Hoehndorf, R. (2012). Text-mining solutions for biomedical research: Enabling integrative biology. Nature Reviews Genetics, 13(12), 829–839.
Santosh, K. (2015). g-dice: Graph mining-based document information content exploitation. International Journal on Document Analysis and Recognition, 18(4), 337–355.
Su, X., Gao, G., Wei, H., & Bao, F. (2016). A knowledge-based recognition system for historical Mongolian documents. International Journal on Document Analysis and Recognition, 19(3), 221–235.
Tkaczyk, D., Bolikowski, L., Czeczko, A., & Rusek, K. (2012). A modular metadata extraction system for born-digital articles. In 2012 10th IAPR international workshop on document analysis systems (DAS) (pp. 11–16). IEEE.
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, Ł. (2015). Cermine: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 18(4), 317–335.
Tsai, C.-T., Kundu, G., & Roth, D. (2013). Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM international conference on conference on information & knowledge management (pp. 1733–1738). ACM.
Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2013). Automatic detection of pseudocodes in scholarly documents using machine learning. In 2013 12th international conference on document analysis and recognition (pp. 738–742). IEEE.
Tuarob, S., Kang, S. W., Wettayakorn, P., Pornprasit, C., Sachati, T., Hassan, S.-U., & Haddawy, P. (2020). Automatic classification of algorithm citation functions in scientific literature. IEEE Transactions on Knowledge and Data Engineering, 32(10), 1881–1896. https://doi.org/10.1109/TKDE.2019.2913376
Washio, T., & Motoda, H. (2003). State of the art of graph-based data mining. Acm Sigkdd Explorations Newsletter, 5(1), 59–68.
Wu, J., Williams, K. M., Chen, H.-H., Khabsa, M., Caragea, C., Tuarob, S., Ororbia, A. G., Jordan, D., Mitra, P., & Giles, C. L. (2015). Citeseerx: AI in a digital library search engine. AI Magazine, 36(3), 35–48.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Appendix: Comparsion with Cermine and PDFX on GOLD-standard
Appendix: Comparsion with Cermine and PDFX on GOLD-standard
XQuery for Cermine generated XML files
We used the following XQueries to get the desired metadata from Cermine system’s XML file output.
Journal name XQuery to find the name of the journal.
\(<journaltitle>\)
\(\{data(\$articles/front/journal-meta/journal-title-group/journal-title)\}\)
\(</journaltitle>\)
Title XQuery to find the title of the article.
\(<title>\)
\(\{data(\$articles/front/article-meta/title-group/article-title)\}\)
\(</title>\)
DOI XQuery to find the doi of the article.
\(<doi>\)
\(\{data(\$articles/front/article-meta/article-id)\}\)
\(</doi>\)
Year XQuery to find the year of the article.
\(<pubyear>\)
\(\{data(\$articles/front/article-meta/pub-date/year)\}\)
\(</pubyear>\)
Volume XQuery to find the volume of an article.
\(<volume>\)
\(\{data(\$articles/front/article-meta/volume)\}\)
\(</volume>\)
Issue XQuery to find the issue number of the article.
\(<issue>\)
\(\{data(\$articles/front/article-meta/issue)\}\)
\(</issue>\)
Pages XQuery to find the pages of the article.
\(<firstpage>\)
\(\{data(\$articles/front/article-meta/fpage)\}\)
\(</firstpage>\)
\(<lastpage>\)
\(\{data(\$articles/front/article-meta/lpage)\}\)
\(</lastpage>\)
Keywords XQuery to find the keywords of the article.
\(<keywords>\) \(\{ for \$keyword at \$j in\)
\(\$articles/front/article-meta/kwd-group/kwd\)
\(return keyword\{data(\$keyword)\} /keyword \}\)
\(\langle /keywords\rangle\)
Authors XQuery to find the authors of the article.
\(\langle authors\rangle\) \(\{ for \$authors at \$j in\)
\(\$articles/front/article-meta/contrib-group/contrib/string-name\)
\(return \langle author\rangle \{data(\$authors)\} \langle /author\rangle \}\)
\(\langle /authors\rangle\)
Affiliations XQuery to find the author affiliations of the article.
\(\langle affiliations\rangle\)
\(\{\) \(for \$affiliation at \$j in\)
\(\$articles/front/article-meta/contrib-group/aff\)
\(return \langle institution\rangle\)
\(\{data(\$affiliation/institution)\} \langle /institution\rangle\)
\(\}\) \(\{\)
\(for \$affiliation at \$j in\)
\(\$articles/front/article-meta/contrib-group/aff\)
\(return \langle country\rangle\)
\(\{data(\$affiliation/country)\} \langle /country\rangle\)
\(\}\)
\(\langle /affiliations\rangle\)
H1 XQuery to find the Heading level 1 of the article.
\(\langle section1\rangle\)
\(\{\)
\(for \$h1 at \$j in \$articles/body/sec\)
\(return \langle h1\rangle \{data(\$h1/title)\} \langle /h1\rangle\)
\(\}\)
\(\langle /section1\rangle\)
H2 XQuery to find the Heading level 2 of the article.
\(\langle section2\rangle\)
\(\{\)
\(for \$h2 at \$j in \$articles/body/sec/sec\)
\(return \langle h2\rangle \{data(\$h2/title)\} \langle /h2\rangle\)
\(\}\)
\(\langle /section2\rangle\)
H3 XQuery to find the Heading level 1 of the article.
\(\langle section3\rangle\)
\(\{\) \(for \$h3 at \$j in \$articles/body/sec/sec/sec\)
\(return \langle h3\rangle \{data(\$h3/title)\} \langle /h3\rangle\)
\(\}\)
\(\langle /section3\rangle\)
References XQuery to find the number of references in the article.
\(\langle refcnt\rangle\)
\(\{\)
\(for \$ref at \$j in \$articles/back/ref-list\)
\(return \langle ref\rangle \{count(\$ref/ref)\} \langle /ref\rangle\)
\(\}\)
\(\langle /refcnt\rangle\)
Abstract XQuery to find the abstract of the article.
\(\langle abstract\rangle\)
\(\{count(\$articles/front/article-meta/abstract)\}\)
\(\langle /abstract\rangle\)
XQuery for PDFX generated XML files
We used the following XQueries to get the desired metadata from PDFX system’s XML file output.
Title XQuery to find the title of the article.
\(\langle title\rangle\)
\(\{data(\$articles/article/front/title-group/article-title)\}\)
\(\langle /title\rangle\)
DOI XQuery to find the doi of the article.
\(\langle doi\rangle \{data(\$articles/meta/doi)\} \langle /doi\rangle\)
Abstract XQuery to find the abstract of the article.
\(\langle abstract\rangle\)
\(\{count(\$articles/article/front/abstract)\}\)
\(\langle /abstract\rangle\)
Authors XQuery to find the authors of the article.
\(\langle authors\rangle\) \(\{ for \$authors at \$j in\)
\(\$articles/article/front/region/email\)
\(return \langle email\rangle \{data(\$authors)\} \langle /email\rangle \}\)
\(\langle /authors\rangle\)
H1 XQuery to find the Heading level 1 of the article.
\(\langle section1\rangle\)
\(\{\)
\(for \$h1 at \$j in \$articles/article/body/.//h1\)
\(return \langle h1\rangle \{data(\$h1)\} \langle /h1\rangle\)
\(\}\)
\(\langle /section1\rangle\)
H2 XQuery to find the Heading level 2 of the article.
\(\langle section2\rangle\)
\(\{\)
\(for \$h2 at \$j in \$articles/article/body/.//h2\)
\(return \langle h2\rangle \{data(\$h2)\} \langle /h2\rangle\)
\(\}\)
\(\langle /section2\rangle\)
H3 XQuery to find the Heading level 3 of the article.
\(\langle section3\rangle\)
\(\{\)
\(for \$h3 at \$j in \$articles/article/body/.//h3\)
\(return \langle h3\rangle \{data(\$h3)\} \langle /h3\rangle\)
\(\}\)
\(\langle /section3\rangle\)
References XQuery to find the number of references in the article.
\(\langle ref\rangle\)
\(\{count(\$articles/article/body/section/.//ref-list/ref)\}\)
\(\langle /ref\rangle\)
Table caption XQuery to find the table captions in the article. \(\langle tables\rangle \{\)
\(for \$captions at \$j in\)
\(\$articles/.//region[contains(@class, 'DoCO:TableBox')]\)
\(return \langle table\rangle \{data(\$captions/caption)\}\langle /table\rangle\) \(\}\langle /tables\rangle\)
Figure caption XQuery to find the Figure captions in the article.
\(\langle figures\rangle \{\)
\(for \$captions at \$j in\)
\(\$articles/.//region[contains(@class, 'DoCO:FigureBox')]\)
\(return \langle fig\rangle \{data(\$captions/caption)\}\langle /fig\rangle\)
\(\}\langle /figures\rangle\)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Waqas, M., Anjum, N. & Afzal, M.T. A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics. Scientometrics 128, 4349–4382 (2023). https://doi.org/10.1007/s11192-023-04774-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-023-04774-7