Abstract
In the current information era, data mining has been extensively applied in many fields to discover the huge knowledge. It is desirable to design a system to exchange the sensitive data between the data owner and the client during data mining process regardless of the leakage of information. The privacy preserving mechanism is indeed an essential enabler to protect the data being corrupted in the trend of sophisticated cyber-attack. In fact, an approach to secure the data while mining knowledge is appropriate such that both data owners and users are not able to learn anything from each other data. Hence, we target preserving data privacy by mining on encrypted data. In data mining, similarity is the important factor, which either measures how much alike two data objects are, or describes as a distance with dimensions representing features of the objects. In this work, we firstly propose a privacy preserving approach on computing Jaccard similarity by cloud-assisted for data mining scheme. Particularly, our design focuses on the k-nearest neighbors classification, which is a conventional data mining algorithm. Rather, due to the efficient measurement of the Jaccard similarity, we present a solution how to employ the cloud services to preserve the Jaccard computation between the owner’s data and the client’s queries in a secure manner. We also implemented the proposed scheme on the workstation and observed that the computation efficiency of the proposed approach is well-suited in the data mining applications.
Similar content being viewed by others
References
Björkegren, D., & Grissen, D. (2017). Behavior revealed in mobile phone usage predicts loan repayment. CoRR arXiv:1712.05840.
Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bulletin de la Sociète Vaudense des Sciences Naturelles, 44, 223–270.
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11, 37–50.
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. SIAM News, 23(5), 1–18.
Güvenir, H. A., Demiröz, G., & Ilter, N. (1998). Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals. Artificial Intelligence in Medicine, 13(3), 147–165.
Jiang, L., Wang, D., Cai, Z., & Yan, X. (2007). Survey of improving naive bayes for classification. In Proceedings of the 3rd international conference on advanced data mining and applications, ADMA ’07 (pp. 134–145).
Han, E.-H., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific-Asia conference on knowledge discovery and data mining, PAKDD ’01 (pp. 53–65).
Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In Proceedings of the second international conference on knowledge discovery and data mining, KDD’96 (pp. 202–207).
Lindell, Y., & Pinkas, B. (2002). Privacy preserving data mining. Journal of Cryptology, 15, 177–206.
Vaidya, J., & Clifton, C. (2002). Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02 (pp. 639–644).
Agrawal, R., & Srikant, R. (2000). Privacy-preserving data mining. In Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00 (pp. 439–450).
Du, W., & Zhan, Z. (2003). Using randomized response techniques for privacy-preserving data mining. In Proceedings of the Ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03 (pp. 505–510).
Zhan, J. Z., Matwin, S., & Chang, L. (2007). Privacy preserving multi party decision tree induction. International Journal of Business Intelligence and Data Mining, 2(2), 197–212.
Dwork, C., & Nissim, K. (2004). Privacy-preserving data mining on vertically partitioned databases. In M. Franklin (Ed.), Advances in cryptology—CRYPTO 2004 (pp. 528–544).
Kantarcioglu, M., & Clifton, C. (2004). Privately computing a distributed k-NN classifier. In Proceedings of the 8th European conference on principles of data mining and knowledge discovery databases, PKDD ’04 (pp. 279–290).
Zhan, J. Z., Chang, L., & Matwin, S. (2005). Privacy preserving k-nearest neighbor classification. I. J. Network Security, 1(1), 46–51.
Pascal, P. (1999). Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the 17th International conference on the theory and applications of cryptographic techniques, EUROCRYPT’99 (pp. 223–238).
Qi, Y., & Atallah, M. J. (2008).Efficient privacy-preserving k-nearest neighbor search. In 28th international conference on distributed computing systems (pp. 311–319).
Shaneck, M., Kim, Y., & Kumar, V. (2006). Privacy preserving nearest neighbor search. In Sixth IEEE international conference data mining—Workshops (ICDMW’06) (pp. 541–545).
Rong, H., Wang, H., Liu, J., & Xian, M. (2016). Privacy-preserving k-nearest neighbor computation in multiple cloud environments. IEEE Access, 4, 9589–9603.
Singh, M. D., Krishna, P. R., & Saxena, A. (2009). A privacy preserving Jaccard similarity function for mining encrypted data. In TENCON 2009—2009 IEEE Region 10 conference (pp. 1–4).
Blundo, C., Cristofaro, E. D., & Gasti, P. (2014). Espresso: Efficient privacy-preserving evaluation of sample set similarity. Journal of Computer Security, 22(3), 355–381.
Wong, K., & Kim, M. H. (2013). Privacy-preserving similarity coefficients for binary data. Computers & Mathematics with Applications, 65(9), 1280–1290.
Fan, J., & Vercauteren, F. (2012). Somewhat practical fully homomorphic encryption. IACR Cryptology ePrint Archive, 2012, 144.
Chen, H., Laine, K., & Player, R. (2017). Simple encrypted arithmetic library—seal (v2.1).
Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: A Performance of k-NN Algorithm Using Jaccard Similarity
Appendix: A Performance of k-NN Algorithm Using Jaccard Similarity
This part describes experiments of using Jaccard similarity to measure distances among data samples when classifying by k-NN algorithm. Data sets for experiments include MNIST, WDBC and DERM:
- 1.
MNIST [4] is a public data set of 10 handwritten digits from 0 to 9 (10 classes), available at http://yann.lecun.com/exdb/mnist/. This data set has already been separated into two sets: 60,000 samples of training and 10,000 samples of testing. Figure 6 shows examples of digits in MNIST data set. Each sample is a bitmap image of handwritten digit in size of \(28\times 28\). We convert bitmap matrices of images into 784-dimension-vectors and use as inputs for classification. In short, each MNIST data sample is a binary vector of 784 attributes. In each vector, attributes have values of 1s at corresponding positions of digit shape, otherwise, they are 0s.
- 2.
WDBC [5] is Breast Cancer Wisconsin Data Set (available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29). WDBC includes 699 samples of clinical cases for benign and malignant (two classes). We divide WDBC data set to two sets: 466 samples for training and 233 samples for testing. Originally, each sample has 9 attributes about cells taking integer values in [1, 10]. We express each sample as a binary vector by mapping its attributes to 1s if values is greater than 5, otherwise mapping them to 0s. We divide WDBC data set to two sets: 466 samples for training and 233 samples for testing.
- 3.
DERM [6] is Dermatology Data Set (available at https://archive.ics.uci.edu/ml/datasets/Dermatology). DERM includes 466 real samples of six dermatological problems (six classes). We divide DERM data set to two sets: 244 samples for training and 122 samples for testing. Each data sample has 34 attributes whose values are numbers: 32 attributes take integer values in range [0, 3], one attribute takes integer value in range [0, 1] and one attribute takes value in range [0, 75]. We also express each data sample as a binary vector. In which, binary vector attributes are 1s if corresponding sample attribute values are greater than mean of its range, i.e., 2, 0 and 33 correspondingly. Otherwise, binary vector attributes are 0s.
The characteristics of these three data sets are summarized in Table 3.
Classification process follows Algorithm 1 in Sect. 2.1. We choose values for parameter K as 5, 10 and 15. Similarity between any two samples is Jaccard similarity, computed using formula (3). The detailed performance of k-NN on these three data sets is shown in Table 4.
Generally, we note the high performance of k-NN on all data sets with different parameter settings. The lowest and highest accuracies are 86.9% and 96.3%.
Rights and permissions
About this article
Cite this article
Le, T.T.N., Phuong, T.V.X. Privacy Preserving Jaccard Similarity by Cloud-Assisted for Classification. Wireless Pers Commun 112, 1875–1892 (2020). https://doi.org/10.1007/s11277-020-07131-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-020-07131-6