Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3561801.3561814acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbdiotConference Proceedingsconference-collections
research-article

A Data Cleaning Method for Industrial Data Flow Based on Multistage Combinational Optimization of Rule Set

Published: 10 October 2022 Publication History

Abstract

With the development of the era of big data, the quality of data has become a growing concern of people. Improving data quality has become a very hot topic at present. In this paper, we propose a data cleaning method for industrial data flow based on multistage combinational optimization of rule set. According to the characteristics of the data, excellent cleaning algorithm is selected for the data. The data is evaluated and the cleaning rules are updated. In the first step, feature detection is carried out on the data, and high-quality data is selected as training samples to match the optimal data cleaning algorithm for them. In the second step, the model uses a random forest algorithm to learn the relationship between data features and data cleansing algorithms, and constructs multi-level filtering rules. In the third step, the data is cleansed and iterated to ensure that the rules are updated automatically. Finally, the model can automatically clean the data with a good cleaning effect. The results show that the method presented in this paper can achieve automatic cleaning effect on real industrial data sets, and the cleaning effect can reach 99% accuracy. This method effectively solves the problem of automatic data cleaning and can be used in the actual industrial data system.

References

[1]
Toledano M, Cohen I, Ben-Simhon Y, Real-time anomaly detection system for time series at scale[C]// Knowledge Discovery and Data Mining. PMLR, 2018.
[2]
Chiang F, Miller R J. Discovering data quality rules[J]. Proceedings of the Vldb Endowment, 2008, 1(1):1166-1177. DOI=10.14778/1453856.1453980
[3]
Ban Xiaojuan, Novel method for the evaluation of data quality based on fuzzy control[J]. Journal of Systems Engineering and Electronics, 2008, 19(3):5. DOI=10.1016/S1004-4132(08)60127-9
[4]
Wang X, Wang C. Time Series Data Cleaning: A Survey[J]. IEEE Access, 2019, PP (99):1-1. DOI=10.48550/arXiv.2004.08284
[5]
Spenhoff P, Wortmann H, Semini M. EPEC 4.0: an Industry 4.0-supported lean production control concept for the semi-process industry[J]. Production Planning and Control, 2021(3):1-18. DOI=10.1080/09537287.2020.1864496
[6]
Tian D, Zhu Y, Duan X, An Effective Fuel-Level Data Cleaning and Repairing Method for Vehicle Monitor Platform[J]. IEEE transactions on industrial informatics, 2018. DOI=10.1109/TII.2018.2878396
[7]
Zeng Chen, (2022) An adaptive data cleaning framework: a case study of the water quality monitoring system in China, Hydrological Sciences Journal, 67:7, 1114-1129, DOI=10.1080/02626667.2022.2060106
[8]
Rezig E K, Cao L, Stonebraker M, Data Civilizer 2.0: a holistic framework for data preparation and analytics[J]. Proceedings of the VLDB Endowment, 2019, 12(12):1954-1957. DOI=10.14778/3352063.3352108
[9]
Ge C, Gao Y, Miao X, IHCS: an integrated hybrid cleaning system[J]. Proceedings of the VLDB Endowment, 2019, 12(12):1874-1877. DOI=10.14778/3352063.3352088
[10]
The Design and Implementation of a Cleaning System Prototype[J]. IOP Conference Series: Earth and Environmental Science, 2019, 252(3):032218 (7pp). DOI=10.1088/1755-1315/252/3/032218
[11]
Sun D, Xue S, Wu H, A Data Stream Cleaning System Using Edge Intelligence for Smart City Industrial Environment[J]. IEEE Transactions on Industrial Informatics, 2021, PP (99):1-1. DOI=10.1109/TII.2021.3077865
[12]
Fang J. Research on automatic cleaning algorithm of multi-dimensional network redundant data based on big data[J]. Evolutionary Intelligence, 2021:1-9. DOI=10.1007/s12065-021-00620-y
[13]
D. R. Brillinger, Time Series: Data Analysis and Theory (Classics in Applied Mathematics), vol. 36. Philadelphia, PA, USA: SIAM, 2001.
[14]
Y. Diao, K. Liu, X. Meng, X. Ye, and K. He, ‘‘A big data online cleaning algorithm based on dynamic outlier detection,’’ in Proc. Int. Conf. CyberEnabled Distrib. Comput. Knowl. Discovery (CyberC), Xi'an, China, Sep. 2015, pp. 230–234. DOI=10.1109/CyberC.2015.68

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
BDIOT '22: Proceedings of the 2022 5th International Conference on Big Data and Internet of Things
August 2022
95 pages
ISBN:9781450390361
DOI:10.1145/3561801
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tag

  1. Data cleaning. Big data. Industrial data flow. Cleaning model. Multi-dimensional

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

BDIOT 2022

Acceptance Rates

Overall Acceptance Rate 75 of 136 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 58
    Total Downloads
  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media