DWSA: An Intelligent Document Structural Analysis Model for Information Extraction and Data Mining
<p>The framework of Dynamic Weighting Structural analysis (DWSA). The DWSA framework mainly includes three parts: data pre-processing, feature engineering, and structural classification.</p> "> Figure 2
<p>The file format in the data pre-processing step. We modify the Pdfplumber algorithm framework for PDF document information conversion, which can get more features in document content. The features mainly include size, font family, top, company, etc. The features named top, left, width, and height mean distance from page top, distance from page left, total width, and the total height of the sentence, respectively.</p> "> Figure 3
<p>The context feature fusion algorithm. Through feature engineering, more context feature information is extracted. The context feature information is mapped into vectors for fusion.</p> "> Figure 4
<p>Structured documents output. Document information is extracted for structured comparison. (The language of insurance documents is Chinese. We translate the structured output into English for better understanding. See the appendix file <a href="#electronics-10-02443-f0A3" class="html-fig">Figure A3</a> for the Chinese output).</p> "> Figure 5
<p>The result of the algorithm comparison experiment on the insurance data set. Accuracy and Macro F1-score are used to evaluate different algorithms. The proposed DWSA has the accuracy of 94.68% (3.36% absolute improvement over Adaboost) and the F1-score of 88.29% (5.1% absolute improvement over Adaboost).</p> "> Figure 6
<p>The result of the algorithm comparison experiment on the public data set Baike2018qa and Iris.</p> "> Figure 7
<p>The comparison of the DWSA and the Adaboost algorithm (Insurance data set). The improved model increases the F1-score of the level-1 heading(1) from 71.7% to 82.42%. For the level-3 heading(3), the F1-score is increased by 6.27%. The accuracy of the level-2 heading(2) is improved by 5.49% and reaches 94.99%. The weights of useless content(-1) and body content(0) are reduced relatively, but the accuracy is still increased by 0.32% and 4%, and the final accuracy achieves 92.8% and 96.5%.</p> "> Figure 8
<p>The Macro-average ROC curve of the Adaboost algorithm and The DWSA algorithm. (Insurance data set) (<b>a</b>) Adaboost; (<b>b</b>) DWSA.</p> "> Figure 9
<p>The feature importance of data. F0-F16 respectively represent the features, including size, count, content, font-family, top, left, width, page, height, company, pre-left, pre-size, part, pre-font, total-part, pre-top, and pre-behind. The feature importance of ’page’, ’part’, and ’total-part’ is too low to show in the figure; ’size’ means the font size, ’count’ means the word and punctuation count, ’content’ means the content in documents, ’font-family’ means the font, ’company’ means the insurance company, ’width’ means the total width of the sentence, ’page’ means the page number, ’height’ means the total height of the sentence, ’part’ means the order of the sentence in a paragraph, ’total-part’ means the number of sentences in a paragraph, ’top’ and ’left’ mean the coordinate position of each sentence in the page, ’pre-left’, ’pre-size’, ’pre-font’, and ’pre-top’ are the corresponding features of the previous sentence. (‘pre-left’, ’pre-top’ and so on are just the names we call the features). ’pre-behind’ means the distance from page bottom of the previous sentence. The previous sentence features are incorporated into context feature fusion.</p> "> Figure A1
<p>A typical original Chinese insurance document in PDF format.</p> "> Figure A2
<p>The file format in the data pre-processing step (in Chinese). We modify the Pdfplumber algorithm framework for PDF document information conversion, which can get more features in documents content. The features mainly include size, font family, top, and company etc. The features named top, left, width, and height mean the context relative location information.</p> "> Figure A3
<p>Structured document output. Document information is extracted for structural comparison. (Chinese output).</p> ">
Abstract
:1. Introduction
1.1. Related Studies and the Current Contribution
- This paper proposes an intelligent information extraction framework for insurance documents. We use the techniques of data pre-processing, feature engineering, and structural classification to process the documents. The proposed model can provide a convenient technology platform for insurance practitioners or related researchers.
- A dynamic sample weight algorithm is proposed to address the sample imbalance problem and improve the structural classification part of the DWSA framework. The comprehensive accuracy reaches 94.68% (3.36% absolute improvement). And the Macro F1-score achieves 88.29% (5.1% absolute improvement).
1.2. DWSA Framework
2. Approach
2.1. Data Pre-Processing
2.2. Feature Engineering
2.3. Dynamic Weighting Algorithm
- An initial base learner is trained with the training set with equal weight of each sample.
- The sample weights of the training set are adjusted according to the predicted performance of the learner in the previous round. The algorithm increases the weight of the misclassified samples so that they will receive more attention in the next round of training. The base learner with a low error rate has a high weight, while the base learner with a high error rate has a low weight. Then train a new base learner.
- Repeat the second step until M base learners are obtained, and the final integration result is a combination of M base learners (M is the number of learners).
Algorithm 1 The Dynamic Weighting Algorithm |
|
3. Results and Discussion
3.1. Results of Some Existing Algorithms
3.2. The DWSA Experimental Result
3.3. Experimental Summary
4. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
DWSA | Dynamic Weighting Structural Analysis |
SVM | Support Vector Machines |
ROC | Receiver Operating Characteristic |
AUC | Area Under Curve |
Appendix A
References
- Park, S.; Lee, W.; Lee, J. Learning of indiscriminate distributions of document embeddings for domain adaptation. Intell. Data Anal. 2019, 23, 779–797. [Google Scholar] [CrossRef]
- Aggarwal, C.C.; Zhai, C.X. A survey of text classification algorithms. In Mining Text Data; Springer: Boston, MA, USA, 2012; pp. 163–222. [Google Scholar]
- Neetu, Hierarchical Classification of Web Content Using Naive Bayes Approach. Int. J. Comput. Sci. Eng. 2013, 5, 402–408. Available online: http://www.enggjournals.com/ijcse/doc/IJCSE13-05-05-117.pdf (accessed on 5 May 2013).
- Thirunavukkarasu, K.S.; Sugumaran, S. Analysis of classification techniques in data mining. Int. J. Eng. Sci. Res. Technol. 2013, 2, 779–797. [Google Scholar]
- Peng, N.; Zhou, X.; Niu, B.; Feng, Y. Predicting Fundraising Performance in Medical Crowdfunding Campaigns Using Machine Learning. Electronics 2021, 10, 143. [Google Scholar] [CrossRef]
- Chakroun, I.; Aa, T.V.; Ashby, T.J. Guidelines for enhancing data locality in selected machine learning algorithms. Intell. Data Anal. 2019, 23, 1003–1020. [Google Scholar] [CrossRef] [Green Version]
- Du, X.; Cardie, C. Document-level event role filler extraction using multi-granularity contextualized encoding. arXiv 2020, arXiv:2005.06579. [Google Scholar]
- Li, M.; Zareian, A.; Zeng, Q.; Whitehead, S.; Lu, D.; Ji, H.; Chang, S.F. Cross-media structured common space for multimedia event extraction. arXiv 2020, arXiv:2005.02472. [Google Scholar]
- Wu, J.; Cai, Z.; Zeng, S.; Zhu, X. Artificial immune system for attribute weighted naive bayes classification. Proceedings of The 2013 International Joint Conference on Neural Networks (IJCNN), Dallas, TX, USA, 4–9 August 2013; pp. 1–8. [Google Scholar]
- Zhao, J.; Forouraghi, B. An interactive and personalized cloud-based virtual learning system to teach computer science. In Advances in Web-Based Learning—ICWL 2013; Wang, J.F., Lau, R., Eds.; Springer: Berlin, Germany, 2013. [Google Scholar]
- Khan, A.; Ilyas, T.; Umraiz, M.; Mannan, Z.I.; Kim, H. CED-Net: Crops and Weeds Segmentation for Smart Farming Using a Small Cascaded Encoder-Decoder Architecture. Electronics 2020, 9, 1602. [Google Scholar] [CrossRef]
- Zhang, S.; Wu, G.; Gu, J.; Han, J. Pruning Convolutional Neural Networks with an Attention Mechanism for Remote Sensing Image Classification. Electronics 2020, 9, 1209. [Google Scholar] [CrossRef]
- Kravcik, M.; Wan, J. Towards open corpus adaptive e-learning systems on the web. In Advances in Web-Based Learning—ICWL 2013; Wang, J.F., Lau, R., Eds.; Springer: Berlin, Germany, 2013. [Google Scholar]
- Lu, H.-Y.; Kang, N.; Li, Y.; Zhan, Q.-Y.; Xie, J.-Y.; Wang, C.-J. Utilizing Recurrent Neural Network for topic discovery in short text scenarios. Intell. Data Anal. 2019, 23, 259–277. [Google Scholar] [CrossRef]
- Zhang, Z.; Lin, H.; Li, P.; Wang, H.; Lu, D. Improving semi-supervised text classification by using Wikipedia knowledge, In Web-Age Information Management; Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J., Eds.; Springer: Berlin, Germany, 2013. [Google Scholar]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Dhaliwal, S.S.; Nahid, A.A.; Abbas, R. Effective intrusion detection system using XGBoost. Information 2018, 9, 149. [Google Scholar] [CrossRef] [Green Version]
- Ogunleye, A.; Wang, Q.G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinf. 2019, 17, 2131–2140. [Google Scholar] [CrossRef] [PubMed]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
- Gómez, P.; Roncancio, C.; Casallas, R. Analysis and evaluation of document-oriented structures. Data Knowl. Eng. 2021, 134, 101893. [Google Scholar] [CrossRef]
- Liu, L.; Wang, Z.; Qiu, T.; Chen, Q.; Lu, Y.; Suen, C.Y. Document image classification: Progress over two decades. Neurocomputing 2021, 453, 223–240. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2008, 16, 321–357. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 318–327. [Google Scholar] [CrossRef] [Green Version]
- Han, X.; Zhao, J. Named entity disambiguation by leveraging wikipedia semantic knowledge. In Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China, 2–6 November 2009; pp. 215–224. [Google Scholar]
- Li, X.; Feng, J.; Meng, Y.; Han, Q.; Wu, F.; Li, J. A unified mrc framework for named entity recognition. arXiv 2019, arXiv:1910.11476. [Google Scholar]
- Pennacchiotti, M.; Pantel, P. Entity extraction via ensemble semantics. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, Singapore, 6–7 August 2009; pp. 238–247. [Google Scholar]
- Rijhwani, S.; Preoţiuc-Pietro, D. Temporally-Informed Analysis of Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online Event, 6–8 July 2020; pp. 7605–7617. [Google Scholar]
- Sikelis, K.; Tsekouras, G.E.; Kotis, K. Ontology-Based Feature Selection: A Survey. Future Internet 2021, 13, 158. [Google Scholar] [CrossRef]
- Han, X.; Zhao, J. Structural semantic relatedness: A knowledge-based method to named entity disambiguation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 11–16 July 2010; pp. 50–59. [Google Scholar]
- Milne, D.; Witten, I.H. Learning to link with wikipedia. In Proceedings of the 17th ACM conference on Information and knowledge management, Napa Valley, CA, USA, 26–30 October 2008; pp. 509–518. [Google Scholar]
- Shen, W.; Wang, J.; Luo, P.; Wang, M. Linking named entities in tweets with knowledge base via user interest modeling. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, Chicago, IL, USA, 11–14 August 2013; pp. 68–76. [Google Scholar]
- Zhang, W.; Sim, Y.C.; Su, J.; Tan, C.L. Entity linking with effective acronym expansion, instance selection and topic modeling. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 19–22 July 2011; pp. 1909–1914. [Google Scholar]
- He, Z.; Liu, S.; Li, M.; Zhou, M.; Zhang, L.; Wang, H. Learning Entity Representation for Entity Disambiguation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 4–9 August 2013; pp. 30–34. [Google Scholar]
- Kulkarni, S.; Singh, A.; Ramakrishnan, G.; Chakrabarti, S. Collective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 457–466. [Google Scholar]
- Yu, D.; Sun, K.; Cardie, C.; Yu, D. Dialogue-based relation extraction. arXiv 2020, arXiv:2004.08056. [Google Scholar]
- Zhang, T.; Liu, K.; Zhao, J. Cross Lingual Entity Linking with Bilingual Topic Model. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
- Shi, C.; Quan, J.; Li, M. Information extraction for computer science academic rankings system. In Proceedings of the 2013 International Conference on Cloud and Service Computing, Beijing, China, 4–6 November 2013; pp. 69–76. [Google Scholar]
- Li, X.; Yan, H.; Qiu, X.; Huang, X. Flat: Chinese ner using flat-lattice transformer. arXiv 2020, arXiv:2004.11795. [Google Scholar]
- Chen, M.; Jin, X.; Shen, D. Short text classification improved by learning multigranularity topics. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 19–22 July 2011; pp. 1776–1781. [Google Scholar]
- Wang, K.; Zong, C.; Su, K.Y. A character-based joint model for chinese word segmentation. In Proceedings of the 23rd International Conferenceon Computational Linguistics, Beijing, China, 23–27 August 2010; pp. 1173–1181. [Google Scholar]
- Elzeki, O.M.; Alrahmawy, M.F.; Elmougy, S. A New Hybrid Genetic and Information Gain Algorithm for Imputing Missing Values in Cancer Genes Datasets. Int. J. Intell. Syst. Appl. 2019, 11, 20. [Google Scholar] [CrossRef]
- NLP Chinese Corpus. Available online: https://drive.google.com/file/d/1_vgGQZpfSxN_Ng9iTAvE7hM3Z7NVwXP2/view (accessed on 5 July 2021).
- Iris Data Set. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris (accessed on 5 July 2021).
Class | Content | Number |
---|---|---|
Useless content in the document | ||
Useless(-1) | such as footer, special symbols, catalogue, | 2469 |
etc. | ||
content(0) | The main and useful content of the document | 6704 |
Level-1 Heading(1) | The level-1 heading in the document | 469 |
Level-2 Heading(2) | The level-2 heading in the document | 1575 |
Level-3 Heading(3) | The level-3 heading in the document | 83 |
Comparison | Bonferroni Correction | P-Value |
---|---|---|
SVM-DWSA | - | 0.001 |
Xgboost-TextCNN | - | 0.001 |
LightGBM-DWSA | - | 0.001 |
TextCNN-Adaboost | - | 0.001 |
TextCNN-DWSA | - | 0.001 |
Adaboost-DWSA | - | 0.001 |
LightGBM-TextCNN | - | 0.002 |
Xgboost-DWSA | 0.0033 | 0.003 |
SVM-Xgboost | - | 0.015 |
SVM-Adaboost | - | 0.034 |
SVM-LightGBM | - | 0.087 |
SVM-TextCNN | - | 0.144 |
Xgboost-LightGBM | - | 0.474 |
LightGBM-Adaboost | - | 0.678 |
Xgboost-Adaboost | - | 0.764 |
Algorithm | Useless(-1) | Content(0) | Level-1 Heading(1) | Level-2 Heading(2) | Level-3 Heading(3) | |
---|---|---|---|---|---|---|
Precision | 0.91 | 0.93 | 0.65 | 0.82 | 0.75 | |
SVM | Recall | 0.88 | 0.95 | 0.62 | 0.89 | 0.52 |
F1-score | 0.90 | 0.94 | 0.63 | 0.86 | 0.61 | |
Precision | 0.96 | 0.92 | 0.88 | 0.87 | 0.89 | |
Xgboost | Recall | 0.86 | 0.96 | 0.69 | 0.95 | 0.52 |
F1-score | 0.91 | 0.94 | 0.77 | 0.91 | 0.66 | |
Precision | 0.94 | 0.90 | 0.97 | 0.95 | 0.86 | |
Lightgbm | Recall | 0.90 | 0.98 | 0.58 | 0.91 | 0.47 |
F1-score | 0.92 | 0.94 | 0.73 | 0.93 | 0.61 | |
Precision | 0.90 | 0.92 | 0.63 | 0.87 | 0.80 | |
TextCNN | Recall | 0.83 | 0.96 | 0.58 | 0.87 | 0.39 |
F1-score | 0.86 | 0.94 | 0.60 | 0.87 | 0.52 | |
Precision | 0.95 | 0.93 | 0.85 | 0.90 | 0.98 | |
Adaboost | Recall | 0.91 | 0.92 | 0.62 | 0.89 | 0.53 |
F1-score | 0.92 | 0.92 | 0.72 | 0.89 | 0.69 | |
Precision | 0.93 | 0.96 | 0.85 | 0.94 | 0.87 | |
DWSA | Recall | 0.92 | 0.97 | 0.80 | 0.96 | 0.66 |
F1-score | 0.93 | 0.97 | 0.82 | 0.95 | 0.75 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yue, T.; Li, Y.; Hu, Z. DWSA: An Intelligent Document Structural Analysis Model for Information Extraction and Data Mining. Electronics 2021, 10, 2443. https://doi.org/10.3390/electronics10192443
Yue T, Li Y, Hu Z. DWSA: An Intelligent Document Structural Analysis Model for Information Extraction and Data Mining. Electronics. 2021; 10(19):2443. https://doi.org/10.3390/electronics10192443
Chicago/Turabian StyleYue, Tan, Yong Li, and Zonghai Hu. 2021. "DWSA: An Intelligent Document Structural Analysis Model for Information Extraction and Data Mining" Electronics 10, no. 19: 2443. https://doi.org/10.3390/electronics10192443
APA StyleYue, T., Li, Y., & Hu, Z. (2021). DWSA: An Intelligent Document Structural Analysis Model for Information Extraction and Data Mining. Electronics, 10(19), 2443. https://doi.org/10.3390/electronics10192443