Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1835804.1835869acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Discovering frequent patterns in sensitive data

Published: 25 July 2010 Publication History

Abstract

Discovering frequent patterns from data is a popular exploratory technique in datamining. However, if the data are sensitive (e.g., patient health records, user behavior records) releasing information about significant patterns or trends carries significant risk to privacy. This paper shows how one can accurately discover and release the most significant patterns along with their frequencies in a data set containing sensitive information, while providing rigorous guarantees of privacy for the individuals whose information is stored there.
We present two efficient algorithms for discovering the k most frequent patterns in a data set of sensitive records. Our algorithms satisfy differential privacy, a recently introduced definition that provides meaningful privacy guarantees in the presence of arbitrary external information. Differentially private algorithms require a degree of uncertainty in their output to preserve privacy. Our algorithms handle this by returning 'noisy' lists of patterns that are close to the actual list of k most frequent patterns in the data. We define a new notion of utility that quantifies the output accuracy of private top-k pattern mining algorithms. In typical data sets, our utility criterion implies low false positive and false negative rates in the reported lists. We prove that our methods meet the new utility criterion; we also demonstrate the performance of our algorithms through extensive experiments on the transaction data sets from the FIMI repository. While the paper focuses on frequent pattern mining, the techniques developed here are relevant whenever the data mining output is a list of elements ordered according to an appropriately 'robust' measure of interest.

Supplementary Material

JPG File (kdd2010_thakurta_dfp_01.jpg)
MOV File (kdd2010_thakurta_dfp_01.mov)

References

[1]
Apriori implementation of Ferenc Bodon. http://www.cs.bme.hu/~bodon/en/apriori/.
[2]
Frequent itemset mining implementations repository. http://fimi.helsinki.fi.
[3]
C. Aggarwal, C. C. Aggarwal, and P. S. Yu. A condensation approach to privacy preserving data mining. In Proceedings of the Ninth International Conference on Extending Database Technology (EDBT), pages 183--199, 2004.
[4]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 207--216, May 1993.
[5]
R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering, Taipei, Taiwan, Mar. 1995. IEEE Computer Society, Washington, DC, USA.
[6]
S. Agrawal and J. R. Haritsa. A framework for high-accuracy privacy-preserving mining. In ICDE, pages 193--204, 2005.
[7]
M. Barbaro and T. Zeller. A face is exposed for aol searcher no. 4417749. The New York Times, Aug. 2006.
[8]
R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta. Discovering Frequent Patterns in Sensitive Data . Technical Report NAS-TR-0129-2010, Network and Security Research Center, Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA, USA, Apr. 2010. http://www.cse.psu.edu/~asmith/fim/.
[9]
A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In STOC, pages 609--618, 2008.
[10]
C. Dwork. Differential privacy. In ICALP, LNCS, pages 1--12, 2006.
[11]
C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265--284, 2006.
[12]
A. V. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, pages 211--222, 2003.
[13]
S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary information in data privacy. In KDD, pages 265--273, 2008.
[14]
N. G.N., B. A., H. J., S. K., and E. I.R. Temporal pattern discovery for trends and transient effects: Its application to patient records. In Proceedings of the Fourteenth International Conference on Knowledge Discovery and Data Mining SIGKDD 2008, pages 963--971, 2008.
[15]
M. Götz, A. Machanavajjhala, G. Wang, X. Xiao, and J. Gehrke. Privacy in search logs. CoRR, abs/0904.0682, 2009.
[16]
J. Han and M. Kamber. Data mining: Concepts and techniques. Morgan Kaufmann Publishers, San Fransisco, CA, USA, 2001.
[17]
D. Hand, H. Mannila, and P. Smyth. Principles of data mining. MIT Press, Cambridge, MA, USA, 2001.
[18]
V. Hristidis, editor. Information Discovery on Electronic Health Records. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, Boca Raton, FL, USA, 2009.
[19]
A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas. Releasing search queries and clicks privately. In WWW, pages 171--180, 2009.
[20]
H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. 1(3):259--289, 1997.
[21]
Y. Matias, J. S. Vitter, and W.-C. Ni. Dynamic generation of discrete random variates. Theory Comput. Syst., 36(4):329--358, 2003.
[22]
F. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In SIGMOD Conference, pages 19--30, 2009.
[23]
F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94--103, 2007.
[24]
T. Mielikäinen. On inverse frequent set mining. In 2nd Workshop on Privacy Preserving Data Mining (PPDM 2003), pages 18--23. IEEE Computer Society, 2003.
[25]
L. Sweeney. k-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557--570, 2002.
[26]
T. Washio and H. Motoda. State of the art of graph-based data mining. SIGKDD Explorations, 5:59--68, 2003.

Cited By

View all
  • (2024)Privacy-Enhanced Frequent Sequence Mining and Retrieval for Personalized Behavior PredictionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.339192819(4957-4969)Online publication date: 2024
  • (2024)Privacy-Preserving Frank-Wolfe on Shuffle ModelActa Mathematicae Applicatae Sinica, English Series10.1007/s10255-024-1095-640:4(887-907)Online publication date: 1-Jun-2024
  • (2023)Scaling up differentially private LASSO regularized logistic regression via faster frank-wolfe iterationsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667700(36349-36363)Online publication date: 10-Dec-2023
  • Show More Cited By

Index Terms

  1. Discovering frequent patterns in sensitive data

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
    July 2010
    1240 pages
    ISBN:9781450300551
    DOI:10.1145/1835804
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. differential privacy
    2. exponential mechanism
    3. frequent itemsets
    4. frequent patterns
    5. privacy

    Qualifiers

    • Research-article

    Conference

    KDD '10
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)63
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 17 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Privacy-Enhanced Frequent Sequence Mining and Retrieval for Personalized Behavior PredictionIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.339192819(4957-4969)Online publication date: 2024
    • (2024)Privacy-Preserving Frank-Wolfe on Shuffle ModelActa Mathematicae Applicatae Sinica, English Series10.1007/s10255-024-1095-640:4(887-907)Online publication date: 1-Jun-2024
    • (2023)Scaling up differentially private LASSO regularized logistic regression via faster frank-wolfe iterationsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667700(36349-36363)Online publication date: 10-Dec-2023
    • (2023)Tight data access bounds for private top-k selectionProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619976(37635-37655)Online publication date: 23-Jul-2023
    • (2023)Answering Private Linear Queries Adaptively Using the Common MechanismProceedings of the VLDB Endowment10.14778/3594512.359451916:8(1883-1896)Online publication date: 1-Apr-2023
    • (2023)Differential Privacy Frequent Closed Itemset Mining over Data Stream2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00124(865-872)Online publication date: 1-Nov-2023
    • (2023)Non-Interactive Privacy-Preserving Frequent Itemset Mining Over Encrypted Cloud DataIEEE Transactions on Cloud Computing10.1109/TCC.2023.329137811:4(3452-3468)Online publication date: Oct-2023
    • (2023)A Joint Permute-and-Flip and Its Enhancement for Large-Scale Genomic Statistical Analysis2023 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW60847.2023.00034(217-226)Online publication date: 4-Dec-2023
    • (2023)Differentially Private Two-Party Top-$k$ Frequent Item Mining2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS57875.2023.00045(166-177)Online publication date: Jul-2023
    • (2023)Enhancing Query Performance Using Simultaneous Execution and Vertical Query Splitting2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT56998.2023.10307920(1-4)Online publication date: 6-Jul-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media