Retrieval and Ranking of Combining Ontology and Content Attributes for Scientific Document
<p>Scientific document retrieval and ranking model.</p> "> Figure 2
<p>Example of a mathematical expression and its context. This document is from the open source dataset Ntcir-Mathir-Wikipedia-Corpus (<a href="http://research.nii.ac.jp/ntcir/permission/ntcir-12/perm-en-MathIR.html" target="_blank">http://research.nii.ac.jp/ntcir/permission/ntcir-12/perm-en-MathIR.html</a> (accessed on 1 May 2022)).</p> "> Figure 3
<p>Scientific document index.</p> "> Figure 4
<p>Sentence similarity calculation model.</p> "> Figure 5
<p>NDCG@5 and NDCG@10 values. (<b>a</b>) Chinese dataset. (<b>b</b>) English dataset.</p> "> Figure 6
<p>Comparison of NDCG@10 values of other methods with those of our method.</p> ">
Abstract
:1. Introduction
1.1. Related Work
1.2. Contributions
2. Materials and Methods
2.1. Establish Scientific Document Indices
2.2. Mathematical Expression Similarity Calculation
2.2.1. Related theories
- Hesitant fuzzy sets
- 2.
- Hesitant fuzzy measure
2.2.2. Construct Hesitant Fuzzy Sets of Mathematical Expressions
2.2.3. Mathematical Expression Matching
Algorithm 1 Mathematical expression-matching algorithm | |
Input: Query expression and Result expression | |
Output: Mathematical expression similarity | |
1 | = calculatelenei(, ); //Evaluates the original expression length membership; |
2 | = calculatelenec(, ); |
3 | = calculatenumsub(, ); |
4 | for in //Evaluates the membership of the subexpression of query expression; |
5 | = calulatesub(, ); |
6 | = ; |
7 | for in ; |
8 | = calculatesub(, ); |
9 | = ; |
10 | for in //Resets sub-equation membership according to matching relationships; |
11 | if ( and ); |
12 | ; ; |
13 | else if ( and ); |
14 | ; ; |
15 | else if ( and ); |
16 | ; ; |
17 | else{break;}; |
18 | ; //Build hesitant fuzzy set; |
19 | ; |
20 | ; |
21 | return ; //Return mathematical expression similarity. |
2.3. Text Similarity Calculation
2.3.1. Related Theories
2.3.2. Sentence Similarity Calculation
- Input layer
- 2.
- Word embedding layer
- 3.
- Siamese and BiLSTM feature extraction layer
- 4.
- Attention layer
- 5.
- Similarity calculation layer
2.4. Rank the Retrieval Results
Algorithm 2 Scientific document ranking algorithm | |
Input: Scientific document retrieval results | |
Output: Scientific document ranking results | |
1 | //Define a category dictionary (the key is a category, and the content is a scientific document) |
2 | Dictionary < string, List < string >> = new Dictionary < string, List < string >> (); |
3 | for in //Traverse the retrieval results of scientific document |
4 | { |
5 | if (!.ContainsKey(.category))//Add it if the category does not exist in the dictionary |
6 | . Add(.category, ); |
7 | else //Otherwise, add the scientific document to an existing category |
8 | { |
9 | List < string > Tvalue = [category]; //Stores dictionary values temporarily |
10 | Tvalue. Add(doc); |
11 | [category] = Tvalue; |
12 | } |
13 | } |
14 | //Rank scientific documents in descending order by category popularity |
15 | = Sorting by .Values. Count DESE; |
16 | //Sort the scientific documents in each category in descending order and return the results |
17 | for (var in ) |
18 | { |
19 | return Sorting by doc.postdate DESE; |
20 | } |
3. Results and Discussion
3.1. Mathematical Expression Matching Results
3.2. Ranking Results of Scientific Documents
4. Conclusions
- While searching scientific documents by using mathematical expressions, we will continue to explore the method of extracting related text information to improve the connection between expressions and related texts.
- We will consider the ontological characteristics of scientific documents from multiple angles and extract more ontological information from documents to make the ranking of scientific documents more reasonable.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Behrooz, M.; Richard, Z.; Douglas, W.O. Characterizing Searches for Mathematical Concepts. In Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Champaign, IL, USA, 2–6 June 2019; pp. 57–66. [Google Scholar]
- Dang, E.K.F.; Luk, R.W.P.; Allan, J. A Principled Approach Using Fuzzy Set Theory for Passage-Based Document Retrieval. IEEE Trans. Fuzzy Syst. 2020, 29, 1967–1977. [Google Scholar] [CrossRef]
- van Dinter, R.; Catal, C.; Tekinerdogan, B. A decision support system for automating document retrieval and citation screening. Expert Syst. Appl. 2021, 182, 115261. [Google Scholar] [CrossRef]
- Wu, S.; Zhao, Y.; Parvinzamir, F.; Ersotelos, N.T.; Wei, H.; Dong, F. Literature Explorer: Effective retrieval of scientific documents through nonparametric thematic topic detection. Vis. Comput. 2019, 36, 1337–1354. [Google Scholar] [CrossRef] [Green Version]
- Mansouri, B.; Richard, Z.; Oard, D.W. Learning to Rank for Mathematical Formula Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 11–15 July 2021; pp. 952–961. [Google Scholar] [CrossRef]
- Zhong, W.; Rohatgi, S.; Wu, J.; Giles, C.L.; Zanibbi, R. Accelerating Substructure Similarity Search for Formula Retrieval. Adv. Inf. Retr. 2020, 12035, 714–727. [Google Scholar]
- Tian, X.D.; Zhou, N. Complex Mathematical Expression Retrieval Based on Hierarchical Index. Acta Tech. 2017, 62, 459–470. [Google Scholar]
- Kamali, S.; Tompa, F.W. Retrieving Documents with Mathe-Matical Content. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 28 July–1 August 2013; pp. 353–362. [Google Scholar]
- Pfahler, L.; Morik, K. Semantic Search in Millions of Equations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Washington, DC, USA, 6–10 July 2020; pp. 135–143. [Google Scholar]
- Mansouri, B.; Rohatgi, S.; Oard, D.W.; Wu, J.; Giles, C.L.; Zanibbi, R. Tangent-CFT: An Embedding Model for Mathematical Formulas. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, New York, NY, USA, 2–5 October 2019. [Google Scholar] [CrossRef]
- Dadure, P.; Pakray, P.; Bandyopadhyay, S. BERT-Based Embedding Model for Formula Retrieval. In Proceedings of the CLEF 2021–Conference and Labs of the Evaluation Forum, Bucharest, Romania, 21–24 September 2021. [Google Scholar]
- Reusch, A.; Thiele, M.; Lehner, W. TU_DBS in the ARQMath Lab 2021, CLEF. In Proceedings of the CLEF 2021–Conference and Labs of the Evaluation Forum, Bucharest, Romania, 21–24 September 2021. [Google Scholar]
- Liu, Z.Y.; Xiao, F.Y. An interval-valued Exceedance Method in MCDM with Uncertain Satisfactions. Int. J. Intell. Syst. 2019, 34, 2676–2691. [Google Scholar] [CrossRef]
- Cheng, C.; Ding, W.; Xiao, F.; Pedrycz, W. A Majority Rule-Based Measure for Atanassov-Type Intuitionistic Membership Grades in MCDM. IEEE Trans. Fuzzy Syst. 2022, 30, 121–132. [Google Scholar] [CrossRef]
- Azadfallah, M. A new MCDM approach for ranking of candidates in voting systems. Int. J. Soc. Syst. Sci. 2019, 11, 119. [Google Scholar] [CrossRef]
- Almeida, I.D.P.D.; Corriça, J.V.D.P.; Costa, A.P.D.A.; Costa, I.P.D.A.; Maêda, S.M.D.N.; Gomes, C.F.S.; Santos, M.D. Study of The Location of A Second Fleet for The Brazilian Navy: Structuring and Mathematical Modeling Using SAPEVO-M and VIKOR Methods. In Proceedings of the International Conference of Production Research–Americas, Bahía Blanca, Argentina, 9–11 December 2020; pp. 113–124. [Google Scholar] [CrossRef]
- dos Santos, M.; de Araujo Costa, I.P.; Gomes, C.F.S. Multicrtteria Deciosion-Making in The Selection if Warships: A New Approach to The AHP Method. Int. J. Anal. Hierarchy Process 2021, 13, 147–169. [Google Scholar] [CrossRef]
- Turskis, Z.; Goranin, N.; Nurusheva, A.; Boranbayev, S. Information Security Risk Assessment in Critical Infrastructure: A Hybrid MCDM Approach. Informatica 2019, 30, 187–211. [Google Scholar] [CrossRef]
- Rouhbakhsh, F.F.; Ranjbar, M.; Effati, S.; Hassanpour, H. Multi objective programming problem in the hesitant fuzzy environment. Appl. Intell. 2020, 50, 2991–3006. [Google Scholar] [CrossRef]
- Alcantud, J.C.R.; Santos-García, G.; Peng, X.; Zhan, J. Dual Extended Hesitant Fuzzy Sets. Symmetry 2019, 11, 714. [Google Scholar] [CrossRef] [Green Version]
- Liu, P.; Zhang, X. A new hesitant fuzzy linguistic approach for multiple attribute decision making based on Dempster–Shafer evidence theory. Appl. Soft Comput. 2019, 86, 105897. [Google Scholar] [CrossRef]
- Guo, J.; Yin, J.; Zhang, L.; Lin, Z.; Li, X. Extended TODIM method for CCUS storage site selection under probabilistic hesitant fuzzy environment. Appl. Soft Comput. 2020, 93, 106381. [Google Scholar] [CrossRef]
- Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature Verification Using a “siamese” Time Delay Neural Network. Int. J. Pattern Recognit. Artif. Intell. 1993, 7, 669–688. [Google Scholar] [CrossRef] [Green Version]
- Wang, Z.; Hamza, W.; Florian, R. Bilateral Multi-Perspective Matching for Natural Language Sentences. arXiv 2017, arXiv:1702.03814. [Google Scholar]
- Liu, L.; Wang, Q.; Li, Y. Improved Chinese Sentence Semantic Similarity Calculation Method Based on Multi-Feature Fusion. J. Adv. Comput. Intell. Intell. Inform. 2021, 25, 442–449. [Google Scholar] [CrossRef]
- Zhong, W.; Zhang, X.; Xin, J.; Lin, J.; Zanibbi, R. Approach Zero and Anserini at the CLEF-2021 ARQMath Track: Applying Substructure Search and BM25 on Operator Tree Path Tokens. In Proceedings of the CLEF 2021—Conference and Labs of the Evaluation Forum, Bucharest, Romania, 21–24 September 2021. [Google Scholar]
- Kristianto, G.Y.; Topic, G.; Aizawa, A. Utilizing dependency relationships between math expressions in math IR. Inf. Retr. J. 2017, 20, 132–167. [Google Scholar] [CrossRef]
- Tian, X.; Wang, J. Retrieval of Scientific Documents Based on HFS and BERT. IEEE Access 2021, 9, 8708–8717. [Google Scholar] [CrossRef]
- Tian, X.; Wang, J.; Wen, Y.; Ma, H. Multi-attribute scientific documents retrieval and ranking model based on GBDT and LR. Math. Biosci. Eng. 2022, 19, 3748–3766. [Google Scholar] [CrossRef]
- Pathak, A.; Pakray, P.; Gelbukh, A. Binary vector transformation of math formula for mathematical information retrieval. J. Intell. Fuzzy Syst. 2019, 36, 4685–4695. [Google Scholar] [CrossRef]
- Pathak, A.; Pakray, P.; Das, R. Context guided retrieval of math formulae from scientific documents. J. Inf. Optim. Sci. 2019, 40, 1559–1574. [Google Scholar] [CrossRef]
- Yuan, K.; He, D.; Jiang, Z.; Gao, L.; Tang, Z.; Giles, C.L. Automatic Generation of Headlines for Online Math Questions. Proc. Conf. AAAI Artif. Intell. 2020, 34, 9490–9497. [Google Scholar] [CrossRef]
- Dhar, S.; Roy, S. Mathematical Document Retrieval System based on Signature Hashing. Aptikom J. Comput. Sci. Inf. Technol. 2019, 4, 45–56. [Google Scholar] [CrossRef]
- Scharpf, P.; Mackerracher, I.; Schubotz, M.; Beel, J.; Breitinger, C.; Gipp, B. AnnoMathTeX: A Formula Identifier Annotation Recommender System for STEM Documents. In Proceedings of the 13th ACM Conference, Copenhagen, Denmark, 16–20 September 2019. [Google Scholar]
- Zanibbi, R.; Aizawa, A.; Kohlhase, M.; Ounis, I.; Topi´c, G.; Davila, K. NTCIR-12 Mathir Task Overview. In Proceedings of the NTCIR National Institute of Informatics(NII), Tokyo, Japan, 7–10 June 2016. [Google Scholar]
- Torra, V. Hesitant fuzzy sets. Int. J. Intell. Syst. 2010, 25, 529–539. [Google Scholar] [CrossRef]
- Xu, Z.; Xia, M. Distance and similarity measures for hesitant fuzzy sets. Inf. Sci. 2011, 181, 2128–2138. [Google Scholar] [CrossRef]
- Huang, Z.; Wei, X.; Kai, Y. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
- Oliveira, R.M.; Gonzaga, F.B.; Barbosa, V.C.; Xexéo, G.B. A distributed System for SearchOnMath Based on The Microsoft BizSpark Program. arXiv 2017, arXiv:1711.04189. [Google Scholar]
Membership | Original Expression Length | Parsing Expression Length | Sub-Expression | The Number of Sub-Expressions | |
---|---|---|---|---|---|
ID | |||||
1 | |||||
2 | |||||
k | |||||
nR |
ID | Query Expressions | Query Statements |
---|---|---|
1 | Given n samples data, the sample mean is | |
2 | Two solutions of any quadratic polynomial can be expressed as follows | |
3 | The basic relationship between sines and cosines is called the Pythagorean theorem | |
4 | means that ƒ(x) can be made as close as desired to L by making x close enough but not equal to a | |
5 | The definite integral ƒ(x) over an interval [a, b] can be written as | |
6 | The quadratic polynomial of ƒ(x) can be expressed as follows | |
7 | The relationship between mass and energy in special relativity is as follows | |
8 | O(n2) | The time complexity of the algorithm is |
9 | G denotes Newton’s gravitational constant, m denotes the mass of the electron, and c denotes the speed of light | |
10 | The kinetic energy formula is expressed as follows |
Dataset Name | MAP_5 | MAP_10 | MAP_15 |
---|---|---|---|
English dataset | 0.831 | 0.815 | 0.765 |
Chinese dataset | 0.823 | 0.802 | 0.712 |
Method | NDCG@5 | NDCG@10 | MAP_5 | MAP_10 | ||||
---|---|---|---|---|---|---|---|---|
English | Chinese | English | Chinese | English | Chinese | English | Chinese | |
Math expressions | 0.830 | 0.851 | 0.707 | 0.754 | 0.823 | 0.796 | 0.703 | 0.685 |
Text | 0.750 | 0.799 | 0.685 | 0.727 | 0.803 | 0.763 | 0.690 | 0.703 |
Content | 0.892 | 0.887 | 0.816 | 0.799 | 0.850 | 0.830 | 0.800 | 0.754 |
Content and ontology | 0.913 | 0.923 | 0.900 | 0.906 | 0.892 | 0.874 | 0.875 | 0.854 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, X.; Tian, B.; Tian, X. Retrieval and Ranking of Combining Ontology and Content Attributes for Scientific Document. Entropy 2022, 24, 810. https://doi.org/10.3390/e24060810
Jiang X, Tian B, Tian X. Retrieval and Ranking of Combining Ontology and Content Attributes for Scientific Document. Entropy. 2022; 24(6):810. https://doi.org/10.3390/e24060810
Chicago/Turabian StyleJiang, Xinyu, Bingjie Tian, and Xuedong Tian. 2022. "Retrieval and Ranking of Combining Ontology and Content Attributes for Scientific Document" Entropy 24, no. 6: 810. https://doi.org/10.3390/e24060810