An Improved Bi-LSTM-Based Missing Value Imputation Approach for Pregnancy Examination Data
<p>Distribution of physical examination times during 0–70 weeks of pregnancy.</p> "> Figure 2
<p>The data processing flow path.</p> "> Figure 3
<p>Ratio of data not missing by pregnancy weeks.</p> "> Figure 4
<p>Missing data status on features.</p> "> Figure 5
<p>The process of feature extraction.</p> "> Figure 6
<p>Distribution of physical examination times during pregnancy after preprocessing.</p> "> Figure 7
<p>The theory of KNN algorithm.</p> "> Figure 8
<p>Unidirectional LSTM unit structure.</p> "> Figure 9
<p>Bidirectional LSTM structure.</p> "> Figure 10
<p>Prediction effect of different pregnancy weeks.</p> ">
Abstract
:1. Introduction
1.1. Background
1.2. Related Work
1.3. Contributions
2. Materials and Methods
2.1. Data
2.2. Methods
2.3. Traditional Machine Learning
2.3.1. Relationship Analysis of the Features
2.3.2. Outlier Detection
2.3.3. Data Filtering
2.3.4. Feature Extraction
2.3.5. Balancing the Dataset
2.3.6. Processed Data
2.4. Strategy to Fill Missing Values—Bi-LSTM
- 1
- Cubic spline interpolation
- 2
- KNN filling
- 3
- ST-MVL
- 4
- LSTM Model
- 5
- Bidirectional LSTM Model
3. Results and Analysis
3.1. LSTM Model Hyperparameters
3.2. Analysis of Experimental Results
4. Conclusions and Future Work
- In data processing, there are still some factors that are not considered, such as region, age, etc.
- The method of filling missing valuesis relatively simple at present, and the future research direction is to combine multiple algorithms to deal with different features.
- At present, the data preprocessing process is basically manual processing, so it can save a lot of time and stamina to standardize the processing process and build ETL (extract-transform-load) tools automatically.
- After prediction, screening proper drug treatment programs [33] is also important for doctors and patients. Building a complete intelligent medical system is very meaningful.
- At present, the amount of research data in the medical field is very large, but these data are often chaotic. The format of diagnostic data in different medical institutions is not uniform, so it is difficult to use it directly for research. Constructing a complete standardized medical research database is of great significance for disease prevention, drug development, and other research.
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Routine Urine (UR) | Biochemical D | Routine Blood |
---|---|---|
UR_White Blood Cell(centrifuged) | Biochemical D_Calcium | Blood Routine_Lymphocyte Percentage |
UR_Red Blood Cell(centrifuged) | Biochemical D_Globulin | Blood Routine_Eosinophil Absolute Value |
UR_Epithelial Cell(centrifuged) | Biochemical D_Albumin: Globulin | Blood Routine_Platlet Count |
UR_Average RBC Hemoglobin Amount | Biochemical D_Total Bilirubin | Blood Routine_RBC Distribution Width |
UR_Average Platelet Volume | Biochemical D_Alanine Aminotransferase | Blood Routine_Monocyte Percentage |
UR_Fungus | Biochemical D_Iron | Blood Routine_Red Blood Cell |
UR_Granular Cast | Biochemical D_Direct Bilirubin | Blood Routine_Eosinophil Percentage |
UR_Mucus Strand | Biochemical D_Total Bile Acid | Blood Routine_Monocyte Absolute Value |
UR_Abnormal RBC | Biochemical D_Phosphorus | Blood Routine_ Average RBC Hemoglobin Amount |
UR_Red Blood Cell | Biochemical D_Aspartate Aminotransferase | Blood Routine_White Blood Cell |
UR_Hyaline Cast | Biochemical D_Total Protein | Blood Routine_Basophil Percentage |
UR_Urate Crystal | Biochemical D_Glutamic-pyruvic Aminotransferase | Blood Routine_Average RBC Volume |
UR_Sulfa Crystal | Biochemical D_Glutamic-pyruvic:Glutamic-oxalacetic | Blood Routine_RBC Distribution Width-SD |
UR_Crystal | Biochemical D_Glutamic-oxalacetic Aminotransferase | Blood Routine_Large Platelet Ratio |
UR_Epithelial Cell | Biochemical D_Indirect Bilirubin | Blood Routine_Average RBC Hemoglobin Concentration |
UR_RBC Cast | Biochemical D_Total Cholesterol | RBC Distribution Width-CV |
UR_Normal RBC | Biochemical D_Lactic Dehydrogenase | Glycosylated Hemoglobin |
UR_Phosphate Crystal | Biochemical D_Kalium | Blood Routine_Large Platelet Count |
UR_Pyocyte | Biochemical D_Carbon Dioxide Concentration | Blood Routine_Urobilinogen |
UR_Trichomonad | Biochemical D_Triglyceride | Blood Routine_PH |
UR_Inorganic Salt Crystal | Biochemical D_Sodium | Blood Routine_Specific Gravity |
UR_WBC Cast | Biochemical D_Creatine Kinase | Blood Routine_Irregular Antibody Screening(3 cells) |
UR_Cast | Biochemical D_Creatinine | Blood Routine_Fibrinogen |
UR_Waxy Cast | Biochemical D_Uric Acid | Blood Routine_Thrombin Time |
UR_Oxalate Crystal | Biochemical D_Chlorine | Blood Routine_Prothrombin Time |
UR_Average RBC Volume | Biochemical D_-glutamyl Transpeptidase | Blood Routine_Activated Partial Thromboplastin Time |
UR_Progesterone | Biochemical D_Alkaline Phosphatase | Blood Routine_PT International Standardized Ratio |
UR_RBC Distribution Width-SD | Biochemical D_Magnesium | Blood Routine_Sugar Shaker |
UR_Hemoglobin | Biochemical D_Aspartate:Alanine | Blood Routine_Platelet Distribution Width |
UR_Lymphocyte Absolute Value | Biochemical D_Creatinine(enzymic method) | Blood Routine_Hematokrit |
UR_Neutrophil Percentage | Biochemical D_Serum Phosphorus | Blood Routine_Basophil Absolute Value |
UR_Hematokrit | Biochemical D_Glycated Albumin Ratio | Blood Routine_Average Platelet Volume |
UR_Lymphocyte Percentage | Biochemical D_PH | Blood Routine_Lymphocyte Absolute Value |
UR_Average RBC hemoglobin Concentration | Biochemical D_Specific Gravity | |
UR_Intermediate Cell Percentage | Biochemical D_Serum Thyrotropin | |
UR_Intermediate Cell Absolute Value | Biochemical D_Low Density Lipoprotein Cholesterin | |
UR_Large Platelet Ratio | Biochemical D_High Density Lipoprotein Cholesterin | |
UR_Platelet Count | Biochemical D_Serum Free T4 | |
UR_Platelet Distribution Width | Biochemical D_Thyroid Peroxidase Antibody | |
UR_Neutrophil Absolute Value | Biochemical D_Creatine Kinase Isoenzyme | |
Biochemical D_Lipoprotein(a) | ||
Biochemical D_Apolipoprotein B | ||
Biochemical D_Apolipoprotein A |
References
- Vest, A.R.; Cho, L.S. Hypertension in pregnancy. Curr. Atheroscler. Rep. 2014, 16, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Riise, H.K.R.; Sulo, G.; Tell, G.S.; Igland, J.; Nygård, O.; Iversen, A.C.; Daltveit, A.K. Association between gestational hypertension and risk of cardiovascular disease among 617,589 Norwegian women. J. Am. Heart Assoc. 2018, 7, e008337. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, J.; Chang, L.; Yu, G. Effective data decision-making and transmission system based on mobile health for chronic disease management in the elderly. IEEE Syst. J. 2020, 15, 5537–5548. [Google Scholar] [CrossRef]
- Yu, G.; Wu, J. Efficacy prediction based on attribute and multi-source data collaborative for auxiliary medical system in developing countries. Neural Comput. Appl. 2022, 34, 5497–5512. [Google Scholar] [CrossRef]
- Ohkuchi, A.; Hirashima, C.; Takahashi, K.; Suzuki, H.; Matsubara, S. Prediction and prevention of hypertensive disorders of pregnancy. Hypertens. Res. 2017, 40, 5–14. [Google Scholar] [CrossRef]
- Ukah, U.V.; De Silva, D.A.; Payne, B.; Magee, L.A.; Hutcheon, J.A.; Brown, H.; Ansermino, J.M.; Lee, T.; von Dadelszen, P. Prediction of adverse maternal outcomes from pre-eclampsia and other hypertensive disorders of pregnancy: A systematic review. Pregnancy Hypertens. 2018, 11, 115–123. [Google Scholar] [CrossRef] [Green Version]
- Hasija, A.; Balyan, K.; Debnath, E.; Ravi, V.; Kumar, M. Prediction of hypertension in pregnancy in high risk women using maternal factors and serial placental profile in second and third trimester. Placenta 2021, 104, 236–242. [Google Scholar] [CrossRef]
- Kassam, S.A. Robust hypothesis testing and robust time series interpolation and regression. J. Time Ser. Anal. 1982, 3, 185–194. [Google Scholar] [CrossRef]
- Kramer, O. K-nearest neighbors. In Dimensionality Reduction with Unsupervised Nearest Neighbors; Springer: Berlin/Heidelberg, Germany, 2013; pp. 13–23. [Google Scholar]
- Candes, E.J.; Plan, Y. Matrix completion with noise. Proc. IEEE 2010, 98, 925–936. [Google Scholar] [CrossRef] [Green Version]
- Ma, J.; Cheng, J.C.; Jiang, F.; Chen, W.; Wang, M.; Zhai, C. A bi-directional missing data imputation scheme based on LSTM and transfer learning for building energy data. Energy Build. 2020, 216, 109941. [Google Scholar] [CrossRef]
- Chen, Z.; Xu, H.; Jiang, P.; Yu, S.; Lin, G.; Bychkov, I.; Hmelnov, A.; Ruzhnikov, G.; Zhu, N.; Liu, Z. A transfer Learning-Based LSTM strategy for imputing Large-Scale consecutive missing data and its application in a water quality prediction system. J. Hydrol. 2021, 602, 126573. [Google Scholar] [CrossRef]
- Ma, J.; Cheng, J.C.; Ding, Y.; Lin, C.; Jiang, F.; Wang, M.; Zhai, C. Transfer learning for long-interval consecutive missing values imputation without external features in air pollution time series. Adv. Eng. Inform. 2020, 44, 101092. [Google Scholar] [CrossRef]
- Zhou, Y.; Wang, S.; Wu, T.; Feng, L.; Wu, W.; Luo, J.; Zhang, X.; Yan, N. For-backward LSTM-based missing data reconstruction for time-series Landsat images. GISci. Remote Sens. 2022, 59, 410–430. [Google Scholar] [CrossRef]
- Sowmya, V.; Kayarvizhy, N. An Efficient Missing Data Imputation Model on Numerical Data. In Proceedings of the 2021 2nd Global Conference for Advancement in Technology (GCAT), Bangalore, India, 1–3 October 2021; pp. 1–8. [Google Scholar]
- Tzoumpas, K.; Estrada, A.; Miraglio, P.; Zambelli, P. A data filling methodology for time series based on CNN and (Bi) LSTM neural networks. arXiv 2022, arXiv:2204.09994. [Google Scholar]
- Jiao, Y.; Qi, H.; Wu, J. Capsule network assisted electrocardiogram classification model for smart healthcare. Biocybern. Biomed. Eng. 2022, 42, 543–555. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Chen, Y.; Miao, D.; Zhang, H. Neighborhood outlier detection. Expert Syst. Appl. 2010, 37, 8745–8749. [Google Scholar] [CrossRef]
- Jiang, S.Y.; An, Q.B. Clustering-based outlier detection method. In Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, Jinan, China, 18–20 October 2008; Volume 2, pp. 429–433. [Google Scholar]
- Liu, Z.; Pi, D.; Jiang, J. Density-based trajectory outlier detection algorithm. J. Syst. Eng. Electron. 2013, 24, 335–340. [Google Scholar] [CrossRef]
- Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
- Syms, C. Principal Components Analysis. In Encyclopedia of Ecology, 2nd ed.; Fath, B., Ed.; Elsevier: Oxford, UK, 2019; pp. 566–573. [Google Scholar]
- Omuya, E.O.; Okeyo, G.O.; Kimwele, M.W. Feature selection for classification using principal component analysis and information gain. Expert Syst. Appl. 2021, 174, 114765. [Google Scholar] [CrossRef]
- Li, K.; Zhang, W.; Lu, Q.; Fang, X. An improved SMOTE imbalanced data classification method based on support degree. In Proceedings of the 2014 International Conference on Identification, Information and Knowledge in the Internet of Things, Beijing, China, 17–18 October 2014; pp. 34–38. [Google Scholar]
- Kalton, G.; Kish, L. Some efficient random imputation methods. Commun. Stat. Theory Methods 1984, 13, 1919–1939. [Google Scholar] [CrossRef]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar]
- Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 64–67. [Google Scholar]
- McKinley, S.; Levine, M. Cubic spline interpolation. Coll. Redw. 1998, 45, 1049–1060. [Google Scholar]
- Yi, X.; Zheng, Y.; Zhang, J.; Li, T. ST-MVL: Filling Missing Values in Geo-Sensory Time Series Data. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Chang, L.; Wu, J.; Moustafa, N.; Bashir, A.K.; Yu, K. AI-driven synthetic biology for non-small cell lung cancer drug effectiveness-cost analysis in intelligent assisted medical systems. IEEE J. Biomed. Health Inform. 2021, 26, 5055–5066. [Google Scholar] [CrossRef]
Name | Type | Example |
---|---|---|
Pregnogram_Fetal Position | Text | “Cephalic”/“Unclear” |
Pregnogram_Fetal Heart | Integer | 140/150 |
Pregnogram_Urine Protein | Text | “++” |
Pregnogram_Diastolic Pressure | Integer | 90/100 |
Pregnogram_Systolic Pressure | Integer | 130/126 |
Pregnogram_Fundal Height | Integer | 31/18 |
Pregnogram_Abdominal Circumference | Integer | 103/80 |
Pregnogram_Weight | Integer | 63/71 |
Pregnogram_Head-pelvic Relationship | Text | “Floating in”/“Unclear” |
Pregnogram_Edema | Text | “++” |
Feature | Relation Degree |
---|---|
Pregnogram_Weight | 0.179742 |
Pregnogram_Diastolic Pressure | 0.164976 |
Pregnogram_Abdominal Circumference | 0.139653 |
Pregnogram_Systolic Pressure | 0.138667 |
Pregnogram_Fundal Height | 0.110129 |
Pregnogram_Fetal Position | 0.042483 |
Pregnogram_Head-pelvic relationship | 0.031845 |
Blood Routine_Platelet Count | 0.013081 |
Blood Routine_Lymphocyte Percentage | 0.011491 |
Blood Routine_White Blood Cell | 0.010346 |
Feature | Relation Degree |
---|---|
Biochemical D_Total Protein | 0.002450 |
Biochemical D_Total Bilirubin | 0.002394 |
Biochemical D_Phosphorus | 0.002390 |
Biochemical D_Calcium | 0.002237 |
Biochemical D_Alanine Aminotransferase | 0.002205 |
Biochemical D_Aspartate Aminotransferase | 0.002165 |
Biochemical D_Alobulin | 0.001950 |
Biochemical D_Albumin: Globulin | 0.001868 |
Blood Routine_Basophil Absolute Value | 0.001654 |
Pregnogram_Urine Protein | 0.001415 |
Common Dimension Reduction Methods | Drawbacks |
---|---|
Missing rate ratio | Removing features directly, harming data richness |
Low variance filtering | |
High relation filtering | |
Random forest | |
Reverse feature removal | Consuming too much time |
Forward feature construction |
Data | Original Data | Preprocessed Data |
---|---|---|
Number of examination records | 120,396 | 53,272 |
Number of features | 141 | 36 |
x | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 |
0.011 | 0.085 | 0.076 | 0.105 | 0.062 | 0.176 | 0.127 | 0.142 | 0.102 | 0.113 |
Optimizer | Adam method |
Loss function | Cross-entropy and L2 regulation = |
Parameter initialization | Set value to zero. |
Dimension of word vectors | 32 |
Dimension of position vectors | 5 |
Batch | 128 |
Adam learning rate |
Filling Methods | Cubic Spline Interpolation | KNN Filling | LSTM Model | ST-MVL | Improved Bidirectional LSTM Model |
---|---|---|---|---|---|
SMAPE (%) | 8.745 | 6.746 | 8.796 | 6.734 | 6.569 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lu, X.; Yuan, L.; Li, R.; Xing, Z.; Yao, N.; Yu, Y. An Improved Bi-LSTM-Based Missing Value Imputation Approach for Pregnancy Examination Data. Algorithms 2023, 16, 12. https://doi.org/10.3390/a16010012
Lu X, Yuan L, Li R, Xing Z, Yao N, Yu Y. An Improved Bi-LSTM-Based Missing Value Imputation Approach for Pregnancy Examination Data. Algorithms. 2023; 16(1):12. https://doi.org/10.3390/a16010012
Chicago/Turabian StyleLu, Xinxi, Lijuan Yuan, Ruifeng Li, Zhihuan Xing, Ning Yao, and Yichun Yu. 2023. "An Improved Bi-LSTM-Based Missing Value Imputation Approach for Pregnancy Examination Data" Algorithms 16, no. 1: 12. https://doi.org/10.3390/a16010012
APA StyleLu, X., Yuan, L., Li, R., Xing, Z., Yao, N., & Yu, Y. (2023). An Improved Bi-LSTM-Based Missing Value Imputation Approach for Pregnancy Examination Data. Algorithms, 16(1), 12. https://doi.org/10.3390/a16010012