Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation
<p>Inconsistent data detection and reparation framework.</p> "> Figure 2
<p>Pointer merging processes of the <span class="html-italic">WS</span> attribute.</p> "> Figure 3
<p>The effect of different proportions on detection precision (the (<b>a</b>–<b>c</b>) are all conducted in the instance, <span class="html-italic">I</span>, with 1000 data tuples).</p> "> Figure 4
<p>Time cost experimental results (the experiment is conducted in the instance, <span class="html-italic">I</span>, the (<b>a</b>) with data-tuple = 1000 and the (<b>b</b>) with proportion = 0.03).</p> "> Figure 5
<p>Experimental results of two algorithms in <span class="html-italic">I</span> (the (<b>a</b>–<b>c</b>) are all conducted with data-tuple = 1000 and the (<b>d</b>) with proportion = 0.03).</p> "> Figure 6
<p>Experimental results of two algorithms in HPART instance (the (<b>a</b>–<b>c</b>) are all conducted in HPART instance with 1460 tuples and 81 attributes).</p> ">
Abstract
:1. Introduction
1.1. Inconsistent Data Cleaning Status
1.2. Purpose and Structure of This Paper
- (1)
- We propose a heuristic algorithm for inconsistent data detection and reparation, which can repair inconsistent data violating the CFDPs in data sets under unsupervised circumstances;
- (2)
- For inconsistent data detection, the maximum dependency set is used to detect them. By finding recessive dependencies from original dependency sets, we improve the detection accuracy and apply the algorithm to continuous attributes; and
- (3)
- For inconsistent data reparation, we use unsupervised machine learning to learn the correlation among attributes in data sets, and integrate the minimum cost idea and information theory to repair, which makes the repair results most relevant to the initial values with minimum repair times.
2. Materials and Methods
2.1. Problem Description
staff (id, name, age, excellent_staff, work_seniority, work_place, marry_status, monthly_salary, department)
2.2. Inconsistent Data Cleaning Framework
2.2.1. Inconsistent Data Detection Module
Acquiring Related Dependencies
Acquiring the MDS
Location and Representation of Inconsistent Data
2.2.2. Inconsistent Data Reparation Module
Candidate Repair Data Priority Queue
Attribute Correlation Computing
Attribute Values Reparation
2.3. Inconsistent Data Detection Algorithm
2.3.1. Design of Detection Algorithm
Algorithm 1. The dependency lifting algorithm (DLA) flow. |
Input: data instance I, cfdp Output: IDS |
01. get allattr from I; 02. //loop all attributes 03. for each do 04. start_attr attr; 05. cfdp realted cfdps; 06. //get vio(cfdp) of every cfdp 07. for each cfdp do 08. //normalization of every cfdp by setting pointers. 09. nor_cfdp normalization (cfdp); 10. //reverse right part of nor_cfdp to get vio(cfdp) 11. vio(cfdp) reverse RHS of nor_cfdp; 12. end for; 13. for each viocfdp do 14. if ( && ==start_attr) 15. result union(); 16. else if ( && !=start_attr) 17. result intersection(); 18. end if; 19. end for; 20. MDS MDS + cfdp; 21. if (!MDS.contains(result) 22. MDS MDS + result; 23. end if; 24. end for; 25. //detect inconsistent data by MDS from I 26. ins_data.detect();27. //express inconsistent data by quadruple in formula (2) 28. IDS.express(ins_data); 29. return IDS; |
Algorithm 2. The implementation of the function union. |
Input: viocfdp Output: result of pointer-merging |
01. // is start_attr in viocfdp 02. for each viocfdp do 03. buffer every in viocfdp; 04. end for; 05. if (buffer only L, Not U) 06. choose max; 07. result max+”L” 08. else if (buffer only U, Not L) 09. choose min; 10. result min+”U”; 11. else if (buffer exists L and U) 12. judge the location L and U; 13. do set operation in axis; 14. get result; 15. end if; 16. return result; |
2.3.2. The Convergence and Complexity Analysis of DLA
2.4. Inconsistent Data Reparation Algorithm
2.4.1. Design of Reparation Algorithm
Algorithm 3. The C-Repair algorithm flow. |
Input: data instance I, MDS, IDS Output: repaired result I’ |
01. for IDS do 02. for () IDS do 03. //calculate the count by Formula (4) 04. α ; 05. PQ descending order of α; 06. end for; 07. // β is the data element firstly repaired 08. β first element of PQ; 09. conflict-free data tuples in I; 10. //each attribute r.attr in 11. for r.attr do 12. //calculate the correlation by Formula (9) 13. SU(β.attr, r.attr); 14. end for; 15. //each data tuple in 16. for do 17. //calculate the relative distance by Formula (11) 18. RDis(); 19. //calculate the weighted distance by Formula (10) 20. WDis(); 21. end for; 22. buffer ascending order of WDis; 23. class_tuple choose n(n+1)/2 tuples of buffer; 24. I’ repair I by class_tuple; 25. flag β; 26. IDS recalculate IDS by MDS; 27. IDS IDS – flag; 28. end for; 29. return I’; |
2.4.2. The Convergence and Complexity Analysis of C-Repair
3. Results
3.1. Experimental Configuration
3.1.1. Experimental Environment
3.1.2. Experimental Data Instances
3.1.3. Experimental CFDPs
3.2. Experimental Results of the DLA
3.2.1. Acquiring the MDS
3.2.2. Evaluation Indexes
3.2.3. Analysis of the Experimental Results
3.3. Experimental Results of the C-Repair
3.3.1. Evaluation Indexes
3.3.2. Analysis of the Experimental Results
4. Discussion
- (1)
- (2)
- exploring whether we can reduce the time complexity of two algorithms while keeping the detection and reparation ability, which means they can be applied to big data and improves the ability of DLA algorithm to LHS attribute errors; and
- (3)
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Fan, W. Data Quality: From Theory to Practice. ACM SIGMOD Rec. 2015, 44, 7–18. [Google Scholar] [CrossRef]
- Tu, S.; Huang, M. Scalable Functional Dependencies Discovery from Big Data. In Proceedings of the IEEE Second International Conference on Multimedia Big Data, Taipei, Taiwan, 20–22 April 2016; pp. 426–431. [Google Scholar]
- Zhou, J.; Diao, X.; Cao, J.; Pan, Z. An Optimization Strategy for CFDMiner: An Algorithm of Discovering Constant Conditional Functional Dependencies. IEICE Trans. Inf. Syst. 2016, E99-D, 537–540. [Google Scholar] [CrossRef]
- Ma, S.; Duan, L.; Fan, W.; Hu, C.; Chen, W. Extending Conditional Dependencies with Built-in Predicates. IEEE Trans. Knowl. Data Eng. 2015, 27, 3274–3288. [Google Scholar] [CrossRef]
- Liu, X.-L.; Li, J.-Z. Consistent Estimation of Query Result in Inconsistent Data. Chin. J. Comput. 2015, 38, 1727–1738. [Google Scholar]
- Parisi, F.; Grant., J. On repairing and querying inconsistent probabilistic spatio-temporal databases. Int. J. Approx. Reason. 2017, 84, 41–74. [Google Scholar] [CrossRef]
- Bohannon, P.; Fan, W.; Flaster, M.; Rastogi, R. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, 14–16 June 2005; pp. 143–154. [Google Scholar]
- Brisaboa, N.R.; Rodríguez, M.A.; Seco, D.; Troncoso, R.A. Rank-based strategies for cleaning inconsistent spatial databases. Int. J. Geogr. Inf. Sci. 2015, 29, 280–304. [Google Scholar] [CrossRef]
- Bai, L.; Shao, Z.; Lin, Z.; Cheng, S. Fixing inconsistencies of fuzzy spatiotemporal XML data. Appl. Intell. 2017, 1–19. [Google Scholar] [CrossRef]
- Liu, B.; C, M.; Zhou, X. Study on Data Repair and Consistency Query Processing. Comput. Sci. 2016, 43, 232–236, 241. [Google Scholar]
- Arieli, O.; Zamansky, A. A Graded Approach to Database Repair by Context-Aware Distance Semantics; Elsevier North-Holland, Inc.: Amsterdam, The Netherlands, 2016. [Google Scholar] [CrossRef]
- Xu, Y.; Li, Z.; Chen, Q.; Zhong, P. Repairing Inconsistent Relational Data Based on Possible Word Model. J. Softw. 2016, 27, 1685–1699. [Google Scholar]
- Liu, B.; Liu, H. Data consistency repair method for enterprise information integration. Comput. Integr. Manuf. Syst. 2013, 10, 2664–2669. [Google Scholar]
- Kim, J.; Jang, G.J.; Lee, M. Investigation of the Efficiency of Unsupervised Learning for Multi-task Classification in Convolutional Neural Network. In Neural Information Processing; Springer International Publishing: New York, NY, USA, 2016; pp. 547–554. ISSN 0302-9743. [Google Scholar]
- Scholtus, S. A generalized Fellegi-Holt paradigm for automatic error localization. Surv. Methodol. 2016, 42, 1–18. [Google Scholar]
- Zhang, C.; Diao, Y. Conditional functional dependency discovery and data repair based on decision tree. In Proceedings of the International Conference on Fuzzy Systems and Knowledge Discovery, Changsha, China, 13–15 August 2016; pp. 864–868. [Google Scholar]
- Cao, J.; Diao, X. Introduction to Data Quality; National Defense Industry Press: Beijing, China, 2017; pp. 69–71. ISBN 9787118114058. [Google Scholar]
- Le, H.T.; Urruty, T.; Gbèhounou, S.; Lecellier, F.; Martinet, J.; Fernandez-Maloigne, C. Improving retrieval framework using information gain models. Signal Image Video Process. 2017, 11, 1–8. [Google Scholar] [CrossRef]
- Ye, M.; Gao, L.; Wu, C.; Wan, C. Informative Gene Selection Method Based on Symmetric Uncertainly and SVM Recursive Feature Elimination. Pattern Recognit. Artif. Intell. 2017, 30, 429–438. [Google Scholar]
- Ahmad, H.A.; Wang, H. An effective weighted rule-based method for entity resolution. Distrib. Parallel Databases 2018, 36, 593–612. [Google Scholar] [CrossRef]
- Martin, D.; Rosete, A.; Alcala-Fdez, J.; Herrera, F. A New Multiobjective Evolutionary Algorithm for Mining a Reduced Set of Interesting Positive and Negative Quantitative Association Rules. IEEE Trans. Evol. Comput. 2014, 18, 54–69. [Google Scholar] [CrossRef]
- Zhang, X.J.; Wang, M.; Meng, X.F. An Accurate Method for Mining top-k Frequent Pattern Under Differential Privacy. J. Comput. Res. Dev. 2014, 104–114. [Google Scholar]
- Pedersen, A.B.; Mikkelsen, E.M.; Croninfenton, D.; Kristensen, N.R.; Pham, T.M.; Pedersen, L.; Petersen, I. Missing data and multiple imputation in clinical epidemiological research. Clin. Epidemiol. 2017, 9, 157–166. [Google Scholar] [CrossRef] [PubMed]
- Diao, Y.; Liu, K.Y.; Meng, X.; Ye, X.; He, K. A Big Data Online Cleaning Algorithm Based on Dynamic Outlier Detection. In Proceedings of the International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, Nanjing, China, 12–14 October 2015; pp. 230–234. [Google Scholar]
Attributes | Meanings | Value Types | Abbreviations |
---|---|---|---|
id | staff number | numeric | ID |
name | staff name | text | NM |
age | staff age | numeric | AGE |
excellent_staff | excellent staff status | boolean | ES |
work_seniority | years of work | numeric | WS |
work_place | branch office | text | WP |
marry_status | marry status | boolean | MSs |
monthly_salary | monthly salary | numeric | MSy |
department | work department | text | DM |
ID | NM | AGE | ES | WS | WP | MSs | MSy | DM |
---|---|---|---|---|---|---|---|---|
1 | Kitty | 25 | T | 4 | branch#1 | T | 3825 | security |
2 | Ming | 20 | F | 2 | branch#7 | F | 4718 | manager |
3 | Wong | 18 | T | 0.5 | branch#5 | F | 3264 | personnel |
4 | Lity | 36 | F | 3.5 | branch#1 | F | 7560 | manager |
5 | John | 41 | T | 5 | branch#3 | T | 5300 | communication |
6 | Li Lei | 25 | F | 4 | branch#4 | T | 3750 | sale |
7 | Lucy | 36 | T | 3 | branch#2 | F | 4103 | market |
8 | Xie Feng | 26 | F | 2 | branch#3 | T | 3520 | sale |
9 | Jack | 19 | F | 1.5 | branch#2 | F | 5030 | culture |
10 | Kai | 41 | T | 5 | branch#1 | T | 7263 | manager |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, P.; Dai, C.; Wang, W. Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation. Symmetry 2018, 10, 516. https://doi.org/10.3390/sym10100516
Li P, Dai C, Wang W. Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation. Symmetry. 2018; 10(10):516. https://doi.org/10.3390/sym10100516
Chicago/Turabian StyleLi, Pei, Chaofan Dai, and Wenqian Wang. 2018. "Inconsistent Data Cleaning Based on the Maximum Dependency Set and Attribute Correlation" Symmetry 10, no. 10: 516. https://doi.org/10.3390/sym10100516