Abstract
Because of the processing of continuous unstructured large streams of data, mining real-time streaming data is a more challenging research issue than mining static data. The privacy issue persists when sensitive data is included in streaming data. In recent years, there has been significant progress in research on the anonymization of static data. For the anonymization of quasi-identifiers, two typical strategies are generalization and suppression. However, the high dynamicity and potential infinite properties of the streaming data make it a challenging task. To end this, we propose a novel Efficient Approximation and Privacy Preservation Algorithms (EAPPA) framework in this paper to achieve efficient data pre-processing from the live streaming and its privacy preservation with minimum Information Loss (IL) and computational requirements. As the existing privacy preservation solutions for streaming data suffer from the challenges of redundant data, we first propose the efficient technique of data approximation with data pre-processing. We design the Flajolet Martin (FM) algorithm for robust and efficient approximation of unique elements in the data stream with a data cleaning mechanism. We fed the periodically approximated and pre-processed streaming data to the anonymization algorithm. Using adaptive clustering, we propose innovative k-anonymization and l-diversity privacy principles for data streams. The proposed approach scans a stream to detect and reuse clusters that fulfill the k-anonymity and l-diversity criteria for reducing anonymization time and IL. The experimental results reveal the efficiency of the EAPPA framework compared to state-of-art methods.
Similar content being viewed by others
Data availability
We conduct experiments on one real-world dataset: An adult from the UCI repository which is available on https://archive.ics.uci.edu/ml/datasets/Adult, 2020.
References
Kolajo, T., Daramola, O., Adebiyi, A.: Big data stream analysis: a systematic literature review. J Big Data 6, 47 (2019). https://doi.org/10.1186/s40537-019-0210-7
Mahajan, H.B., Uke, N., Pise, P., et al.: Automatic robot Manoeuvres detection using computer vision and deep learning techniques: a perspective of internet of robotics things (IoRT). Multimed. Tools Appl. (2022). https://doi.org/10.1007/s11042-022-14253-5
Gama, J.: A survey on learning from data streams: current and future trends. Progress Artif. Intell. 1(1), 45–55 (2012). https://doi.org/10.1007/s13748-011-0002-6
Mahajan, H.B., Badarla, A., Junnarkar, A.A.: CL-IoT: cross-layer Internet of Things protocol for intelligent manufacturing of smart farming. J. Ambient. Intell. Human Comput. 12, 7777–7791 (2021). https://doi.org/10.1007/s12652-020-02502-0
Mahajan, H.B., Badarla, A.: Application of internet of things for smart precision farming: solutions and challenges. Int. J. Adv. Sci. Technol. Dec. 2018, 37–45 (2018)
Mahajan, H.B., Badarla, A.: Cross-layer protocol for WSN-assisted IoT smart farming applications using nature inspired algorithm. Wireless Pers. Commun. 121, 3125–3149 (2021). https://doi.org/10.1007/s11277-021-08866-6
Sun, D., Zhang, G., Zheng, W., Li, K.: Key technologies for big data stream computing. In: Li, K., Jiang, H., Yang, L.T., Guzzocrea, A. (eds.) Big data algorithms, analytics and applications, pp. 193–214. Chapman and Hall/CRC, New York (2015) . (ISBN 978-1-4822-4055-9)
Joseph, S., Jasmin, E.A., Chandran, S.: Stream computing: opportunities and challenges in smart grid. Procedia Technol. 21, 49–53 (2015). https://doi.org/10.1016/j.protcy.2015.10.008
Li, N., Li, T., Venkatasubramanian, S.: Closeness: A new privacy measure for data publishing. IEEE Trans. Knowl. Data Eng. 22(7), 943–956 (2010). https://doi.org/10.1109/tkde.2009.139
Fung, B., Wang, K., Chen, R., Yu, P.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42, 1–53 (2010). https://doi.org/10.1145/1749603.1749605
Zakerzadeh, H., Aggarwal, C.C., Barker, K.: Managing dimensionality in data privacy anonymization. Knowl. Inf. Syst. 49(1), 341–373 (2016)
Zhang, Y., Szabo, C., Sheng, Q.Z.: Cleaning environmental sensing data streams based on individual sensor reliability. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) Web Information Systems Engineering – WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol. 8787. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11746-1_29
Mahajan, H.B., Rashid, A.S., Junnarkar, A.A., et al.: Integration of Healthcare 4.0 and blockchain into secure cloud-based electronic health records systems. Appl. Nanosci. (2022). https://doi.org/10.1007/s13204-021-02164-0
Mahajan, H., Junnarkar, A., Tiwari, M., Tiwari, T., Upadhyaya, M.: LCIPA: lightweight clustering protocol for industry 4.0 enabled precision agriculture. Microprocess. Microsyst. 94, 104633 (2022). https://doi.org/10.1016/j.micpro.2022.104633
Fischer, P.M., Esmaili, K.S., Miller, R.J.: Stream schema: providing and exploiting static metadata for data stream processing. In Proceedings of the 13th International Conference on Extending Database Technology. 207–218 (2010). https://doi.org/10.1145/1739041.1739068
Reddy, K.S.S., Bindu, C.S.: A review of density-based clustering algorithms for big data analysis. In: International conference on I-SMAC (IoT in Social, Mobile, Analytic, and Cloud), Palladam, India 10–11 February 2017, IEEE (2017). https://doi.org/10.1109/i-smac.2017.8058322
Deepa, M.S., Sujatha, N.: Comparative study of various clustering techniques and its characteristics. Int. J. Adv. Netw. Appl. 5(6), 2104–2116 (2014)
Zubaroğlu, A., Atalay, V.: Data stream clustering: a review. Artif. Intell. Rev. 54, 1201–1236 (2021). https://doi.org/10.1007/s10462-020-09874-x
Xiao, X., Tao, Y.: Dynamic anonymization: accurate statistical analysis with privacy preservation. In: Proceedings of the 27th ACM SIGMOD international conference on management of data, pp. 107–120 (2008)
Qu, Y., Yu, S., Gao, L., Zhou, W., Peng, S.: A Hybrid Privacy Protection Scheme in Cyber-Physical Social Networks. IEEE Trans. Comput. Soc. Syst. 1–12 (2018). https://doi.org/10.1109/tcss.2018.2861775
Liu, P., Xu, Y.X., Jiang, Q., Tang, Y., Guo, Y., Wang, L., Li, X.: Local differential privacy for social network publishing. Neurocomputing 391, 273–279 (2019). https://doi.org/10.1016/j.neucom.2018.11.104
Shao, Y., Liu, J., Shi, S., Zhang, Y., Cui, B.: Fast de-anonymization of social networks with structural information. Data Sci. Eng. (2019). https://doi.org/10.1007/s41019-019-0086-8
Yazdanjue, N., Fathian, M., Amiri, B.: Evolutionary algorithms for k-Anonymity in social networks based on clustering approach. Comput. J. (2019). https://doi.org/10.1093/comjnl/bxz069
Zhang, C., Wu, S., Jiang, H., Wang, Y., Yu, J., Cheng, X.: Attribute-enhanced de-anonymization of online social networks. In: Tagarelli, A., Tong, H. (eds.) Computational Data and Social Networks. CSoNet 2019. Lecture Notes in Computer Science, vol. 11917. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34980-6_29
Siddula, M., Li, Y., Cheng, X., Tian, Z., Cai, Z.: Anonymization in Online Social Networks Based on Enhanced Equi-Cardinal Clustering. IEEE Trans. Comput. Soc. Syst. 1–12 (2019). https://doi.org/10.1109/tcss.2019.2928324
Zhao, P., Huang, H., Zhao, X., Huang, D.: P3: privacy-preserving scheme against poisoning attacks in mobile-edge computing. IEEE Trans. Comput. Soc. Syst. 7(3), 818–826 (2020). https://doi.org/10.1109/tcss.2019.2960824
Cai, Y., Zhang, S., Xia, H., Fan, Y., Zhang, H.: A Privacy-preserving scheme for interactive messaging over online social networks. IEEE Internet Things J. 1–1 (2020). https://doi.org/10.1109/jiot.2020.2986341
Gao, T., Li, F.: Protecting social network with differential privacy under novel graph model. IEEE Access 8, 185276–185289 (2020). https://doi.org/10.1109/ACCESS.2020.3026008
Qu, Y., Yu, S., Zhou, W., Chen, S., Wu, J.: Customizable reliable privacy-preserving data sharing in cyber-physical social network. IEEE Trans. Netw. Sci. Eng. 1–1 (2020). https://doi.org/10.1109/TNSE.2020.3036855
Aldeen, Y.A.A.S., Salleh, M., Aljeroudi, Y.: An innovative privacy preserving technique for incremental datasets on cloud computing. J. Biomed. Inform. 62, 107–116 (2016). https://doi.org/10.1016/j.jbi.2016.06.011
Xiao, X., Tao, Y.: M-invariance. Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data - SIGMOD ’07. (2007). https://doi.org/10.1145/1247480.1247556
Hasan, A., Jiang, Q., Chen, H., Wang, S.: A new approach to privacy-preserving multiple independent data publishing. Appl. Sci. 8(5), 783 (2018). https://doi.org/10.3390/app8050783
Cao, J., Carminati, B., Ferrari, E., Tan, K.-L.: CASTLE: continuously anonymizing data streams. IEEE Trans. Dependable Secure Comput. 8(3), 337–352 (2011). https://doi.org/10.1109/tdsc.2009.47
Guo, K., Zhang, Q.: Fast clustering-based anonymization approaches with time constraints for data streams. Knowl.-Based Syst. 46, 95–108 (2013). https://doi.org/10.1016/j.knosys.2013.03.007
Wang, J., Du, K., Luo, X., et al.: Two privacy-preserving approaches for data publishing with identity reservation. Knowl. Inf. Syst. 60, 1039–1080 (2019). https://doi.org/10.1007/s10115-018-1237-3
Wang, J., Deng, C., Li, X.: Two privacy-preserving approaches for publishing transactional data streams. IEEE Access 6, 23648–23658 (2018). https://doi.org/10.1109/access.2018.2814622
Yang, L., Chen, X., Luo, Y., Lan, X., Wang, W.: IDEA: a utility-enhanced approach to incomplete data stream anonymization. Tsinghua Sci. Technol. 27(1), 127–140 (2022). https://doi.org/10.26599/TST.2020.9010031
U.M. L. Repository, Adult data set (2020). https://archive.ics.uci.edu/ml/datasets/Adult
Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. Knowl. Data Eng. 14(1), 189–201 (2002). https://doi.org/10.1109/69.979982
Alhayani, B.A., AlKawak, O.A., Mahajan, H.B., et al.: Design of quantum communication protocols in quantum cryptography. Wireless Pers. Commun. (2023). https://doi.org/10.1007/s11277-023-10587-x
Patil, S., Vaze, V., Agarkar, P. et al.: Social context-aware and fuzzy preference temporal graph for personalized B2B marketing campaigns recommendations. Soft Comput. (2023). https://doi.org/10.1007/s00500-023-08914-2
Mahajan, H., Reddy, K.T.V.: Secure gene profile data processing using lightweight cryptography and blockchain. Cluster Comput. (2023). https://doi.org/10.1007/s10586-023-04123-6
Funding
This Declaration is not applicable.
Author information
Authors and Affiliations
Contributions
The research work presented in this paper is a part of Ph. D. research of Research Scholar Rahul A. Patil which is carried out under the guidance and supervision of supervisor Dr. Pramod D. Patil.
Corresponding author
Ethics declarations
Ethical approval
This Declaration is not applicable.
Competing interests
This Declaration is not applicable as there are no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Privacy and Security in Machine Learning
Guest Editors: Jin Li, Francesco Palmieri and Changyu Dong
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Patil, R.A., Patil, P.D. Efficient approximation and privacy preservation algorithms for real time online evolving data streams. World Wide Web 27, 5 (2024). https://doi.org/10.1007/s11280-024-01244-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11280-024-01244-9