KR101969126B1

KR101969126B1 - System and method for recommending public data generalization level guaranteeing data anonymity and useful to data analysis

Info

Publication number: KR101969126B1
Application number: KR1020170031290A
Authority: KR
Inventors: 이건명; 강아름
Original assignee: 충북대학교 산학협력단
Priority date: 2017-03-13
Filing date: 2017-03-13
Publication date: 2019-04-16
Also published as: KR20180104473A

Abstract

본 발명은 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 시스템 및 방법에 관한 것으로서, 원본 데이터에 대한 K-익명성 조건인 K 값 및 준식별자 속성별 일반화 수준 허용 범위를 정의하는 일반화 수준 정의부, 정의된 상기 속성별 일반화 수준 허용 범위 내에서 가능한 모든 일반화 수준 조합을 생성하는 일반화 수준 조합 생성부, 생성된 상기 속성별 일반화 수준 조합들의 각각에 대해 익명률 및 균형률을 계산하는 일반화 수준 조합별 익명률 및 균형률 계산부 및 계산된 상기 조합별 균형률 및 익명률을 고려하여 각 조합의 가치를 평가하고 가치가 높은 순으로 소정 개수의 일반화 수준 조합을 제공하는 일반화 수준 조합 추천부를 포함함으로써, 데이터가 가진 민감한 정보의 식별을 막으면서도 공개된 데이터를 다시 데이터 분석에 사용하기 위해 데이터가 가져야 할 충분한 정보를 가질 수 있게 한다.The present invention relates to a public data generation system and method useful for data analysis while ensuring data anonymity. The system includes a K-value anonymity condition for the original data and a generalization level definition A generalization level combination generator for generating all possible generalization level combinations within the allowable range of the generalization level for each attribute defined; a generalization level combination for calculating the anonymity rate and the balance rate for each of the generated generalization level combinations And a generalization level combination recommendation unit for evaluating the values of each combination in consideration of the calculated anonymity rate and balance rate calculation unit and the calculated balance rate and anonymity rate of the combination and providing a predetermined number of generalization level combinations in order of value , It is possible to prevent the identification of the sensitive information of the data, For use in seats and able to have enough information to have the data.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a system and method for generating public data that is useful for data analysis while ensuring data anonymity.

본 발명은 공개 데이터 생성 시스템 및 방법에 관한 것으로, 더욱 상세하게는 데이터를 공개하려는 자에게 데이터 익명성을 보장하면서도 데이터 분석에 유용할 수 있는 공개 데이터 일반화 수준을 추천하여 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터를 생성하게 하는 공개 데이터 생성 시스템 및 방법에 관한 것이다.The present invention relates to a system and method for generating public data, and more particularly, to a system and method for generating public data, and more particularly to a system and method for generating public data, To a public data generation system and method for generating public data useful for analysis.

사물인터넷, 클라우드 컴퓨팅, 빅데이터 등 새로운 기술의 도입으로 처리하는 데이터의 종류와 양이 증가하면서 개인의 민감한 정보를 포함하고 있는 데이터의 공개 및 공유가 활발해지고 있다. 그런데 이러한 개인의 민감한 정보가 포함된 데이터는 데이터에서 개인을 식별할 수 있는 가능성을 높이며, 개인의 식별은 개인의 민감한 정보의 유출을 유발하여 그러한 정보의 악용의 위험성을 높인다. 그에 따라 데이터를 공개할 때, 개인의 민감한 정보의 유출을 막기 위하여 특정 개인이 식별되지 않도록 데이터를 비식별화하는 것이 필요하다. 데이터를 비식별화할 때에는, 개인을 식별할 수 있는 식별자는 필수적으로 제거되어야 하며, 단일 속성으로는 개인을 식별할 수 없지만 속성들의 조합으로 개인을 식별할 수 있게 하는 준식별자에 대해서는 익명화가 이루어져야 한다. 그리고 준식별자를 익명화하기 위해서는 준식별자 속성들의 일반화 수준을 조정해야 한다. 준식별자 속성의 일반화 수준을 높이면 같은 정보의 레코드가 많아지고, 레코드의 식별이 어려워진다.With the introduction of new technologies such as the Internet of things, cloud computing, and big data, the amount and type of data to be processed has increased, and the disclosure and sharing of data containing sensitive information of individuals has become active. However, data containing such sensitive information increases the likelihood of identifying an individual in the data, and identification of an individual increases the risk of abuse of such information by causing the leakage of sensitive information of the individual. Accordingly, when disclosing data, it is necessary to unidentify the data so that a particular individual is not identified in order to prevent leakage of sensitive information of the individual. When data is unidentified, identifiers that can identify individuals must be removed, and quasi identifiers that allow individuals to be identified by a combination of attributes that can not be identified by a single attribute should be anonymized . In order to anonymize the semi-identifier, the generalization level of the semi-identifier attributes should be adjusted. Increasing the generalization level of the semi-identifier attribute increases the number of records of the same information, and it becomes difficult to identify the record.

데이터를 익명화하는 방법으로는 K-익명성(K-anonymity)을 만족시키는 방법이 연구되어왔다. K-익명성 기술은 연결 공격 방어형 프라이버시 보호 모델로서, 데이터베이스의 연관성을 줄여 데이터 집합에서 개인이 식별되지 않게 하는 방법이다. 먼저 K 값(이때, K는 익명성의 수준을 의미하는 계수임)을 정하고, 주어진 데이터 집합에서 준식별자 속성값들이 동일한 레코드가 적어도 K개 존재하도록 함으로써 쉽게 다른 정보로 결합할 수 없도록 한다. 이에 따라 데이터 집합의 일부를 수정하여 모든 레코드가 자기 자신과 동일한(구별되지 않는) K-1개 이상의 레코드를 가지게 된다.As a method of anonymizing data, a method of satisfying K-anonymity has been studied. The K-anonymity technique is a link-defense-protected privacy protection model that reduces the association of a database to prevent individuals from being identified in the data set. First, a K value (where K is a coefficient indicating the level of anonymity) is determined, and in a given data set, the quasi-identifier attribute values have at least K records having the same value so that they can not be easily combined with other information. This modifies a portion of the data set so that all records have the same (non-distinct) K-1 records as themselves.

상기한 테이블을 참조하여, 어떤 병원의 질병 기록이 있는데 그 중 전립선염에 걸린 13053 지역에 사는 28세 남성 홍길동, 전립선염에 걸린 13068 지역에 사는 21세 남성 홍길남, 고혈압 증세가 있는 13068 지역에 사는 29세 여성 홍길자, 고혈압 증세가 있는 13053 지역에 사는 23세 여성 홍갑순이라는 레코드가 포함되어 있다고 하자. K-익명성을 적용하기 전의 원본 데이터는 '구분', '지역코드', '연령', '성별' 속성들을 통해 어떤 질병을 가지고 있는 개인을 식별할 수 있다. 공개되는 데이터는 식별자인 '구분' 속성에 대해서는 제거하고 공개가 되며, 익명성을 만족하는 데이터를 생성하기 위해 준식별자인 [지역코드, 연령, 성별]에 대해 일반화한다. 이 4건의 기록으로부터 개인이 특정되지 않도록 익명화를 한다면, K 값을 4로 하여 적어도 4개의 데이터를 동일한 형태로 취급하여 130 지역 주민 20대 홍모씨, 즉 [130, 2, P]로 표현하면 된다. 4-익명성을 적용한 결과로 해당 테이블에 존재하는 레코드들이 각각 자신과 같은 정보를 가지고 있는 레코드를 3개 이상 가지면서 4건이 모두 동일한 정보가 되는 것이다. 이렇듯 어떤 식으로 찾아도 특정 개인이 어떤 질병을 가지고 있는지 재식별이 되지 않도록 동일한 레코드가 K개 이상 되도록 만드는 것이 K-익명성이라는 개념이다.Referring to the table above, there is a hospital record of a disease, including a 28-year-old male Hong Gil-dong in a 13053 area of prostatitis, a 21-year-old male Hong Gil Nam, who lives in a prostate- A 29 - year - old woman named Hong Gil - ja, a 23 - year - old female living in 13053 with hypertension, is included in the record. The original data before applying K-anonymity can identify individuals with certain diseases through the 'break', 'area code', 'age' and 'gender' attributes. The data to be released is generalized to [semi-code, age, sex], which is a semi-identifier, in order to generate data that satisfies anonymity. If anonymization is performed so that individuals are not identified from these four records, at least four data are handled in the same form with a K value of 4, and the data can be expressed as [130, 2, P]. As a result of applying anonymity, the records in the table have three or more records each having the same information, and all four of them are the same information. The concept of K-anonymity is to make K or more identical records so that any kind of illness is not rediscovered.

살펴본 바와 같이, K 값이 1이라면 기존의 데이터베이스와 동일한 형태로 모든 값의 구별이 가능하나, 익명화는 불가능하다. 반대로, K 값이 커지면 익명성은 높아질 것이나, K 값이 무한대라면 결국 데이터베이스의 모든 내용의 구분이 불가능할 것이다. 이와 같이, K-익명성을 만족시키는 것만을 목적으로 한다면, 익명성은 높아지나 데이터가 가지고 있는 정보의 양이 적어진다. 여기서의 정보란 데이터가 가지고 있는 값의 상세 정도를 말한다. 예를 들어, 생년월일 속성의 값이 ‘861215’인 경우 태어난 ‘년, 월, 일’까지의 정보를 모두 가지고 있지만, 익명화 된 ‘8612**’값의 경우 ‘년, 월’까지의 정보만을 가지고 있으며 ‘일’에 대한 정보는 알 수 없다. 이러한 경우 전자에 비해 후자의 데이터가 가진 정보가 적다고 한다. 데이터의 익명화 수준이 높아지면 각 속성의 값을 표현하는 데이터가 적기 때문에 같은 정보를 갖는 레코드들이 많아지고 데이터가 가지고 있는 정보의 양은 적을 수밖에 없으며, 이는 익명성은 지킬 수 있지만 데이터를 분석하는 데 있어서 적은 정보만을 가진 데이터를 분석하기 때문에 분석 데이터로서의 가치가 떨어지게 될 수 있다. 그리고 데이터의 가치가 떨어지면 데이터 분석의 정확도도 떨어질 확률이 크다.As we have seen, if K value is 1, all values can be distinguished in the same form as existing database, but anonymization is impossible. Conversely, an increase in K will increase anonymity, but if the K value is infinite, it will not be possible to distinguish all the contents of the database. Thus, if only the purpose of satisfying K-anonymity is aimed at, the anonymity is increased but the amount of information held by the data is small. The information here is the degree of detail of the data. For example, if the value of the birthdate attribute is '861215', it has all the information up to 'year, month, and day', but if it is anonymous '8612 **' There is no information about 'work'. In this case, the latter data is less information than the former. As the level of anonymization of data increases, there is less data representing the value of each property, so the number of records having the same information increases and the amount of information possessed by the data is small. This is an anonymity, Since the data having only information is analyzed, the value as analytical data may be deteriorated. And if the value of the data is low, the accuracy of the data analysis is likely to fall.

기존의 데이터를 익명화하는 방법은 데이터의 익명성을 보장되지만 해당 데이터를 다시 분석에 활용하기 위해서 필요한 정보가 충분한지 확신할 수 없는 문제가 있었다. 따라서, 데이터의 통계적 특징을 최대한 살리면서 개인에 대해 식별이 불가능하도록, K-익명성도 어느 정도 만족하면서 데이터 속성별로 최소의 일반화 수준을 정해서 이를 지킴으로써 데이터의 가치를 지키는 익명화 방법이 필요하다.The anonymization of existing data guarantees the anonymity of the data, but there is a problem that it is not certain that sufficient information is needed to use the data again for analysis. Therefore, there is a need for an anonymizing method that keeps the value of the data by keeping the minimum generalization level for each data attribute while satisfying the K-anonymity to some extent, while making the statistical characteristic of the data as possible as possible while making it impossible to identify the individual.

대한민국 공개특허공보 제10-2012-0063050호(2012.06.15)Korean Patent Publication No. 10-2012-0063050 (June 15, 2012) 대한민국 공개특허공보 제10-2013-0049528호(2013.05.14)Korean Patent Laid-Open Publication No. 10-2013-0049528 (2013.05.14)

따라서, 본 발명은 상기한 종래 기술의 문제점을 해결하기 위해 이루어진 것으로서, 본 발명의 목적은 SUMMARY OF THE INVENTION Accordingly, the present invention has been made keeping in mind the above problems occurring in the prior art,

데이터를 공개하는 데 있어서 데이터가 가진 민감한 정보의 식별을 막으면서도 공개된 데이터를 데이터 분석에 활용하기에 충분한 정보를 가질 수 있도록 준식별자 속성의 일반화 수준을 조정하여 공개하려는 자가 선택한 일반화 수준 범위 내에서 최대한 일반화 수준을 낮추면서 데이터 익명성을 만족하는 데이터를 공개 데이터로 생성할 수 있는 In order to ensure that the disclosure of data has sufficient information to utilize the published data for data analysis while preventing the identification of the sensitive information of the data, Data that meets data anonymity can be generated as open data while minimizing the level of generalization as much as possible.

공개 데이터 생성 시스템 및 방법을 제공하는 데 있다.And to provide a public data generation system and method.

상기와 같은 목적을 달성하기 위한 본 발명의 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 시스템은, 원본 데이터에 대한 K-익명성 조건인 K 값 및 준식별자 속성별 일반화 수준 허용 범위를 정의하는 일반화 수준 정의부, 정의된 상기 속성별 일반화 수준 허용 범위 내에서 가능한 모든 일반화 수준 조합을 생성하는 일반화 수준 조합 생성부, 생성된 상기 속성별 일반화 수준 조합들의 각각에 대해 익명률 및 균형률을 계산하는 일반화 수준 조합별 익명률 및 균형률 계산부 및 계산된 상기 조합별 균형률 및 익명률을 고려하여 각 조합의 가치를 평가하고 가치가 높은 순으로 소정 개수의 일반화 수준 조합을 제공하는 일반화 수준 조합 추천부를 포함한다.In order to achieve the above object, a public data generation system useful for data analysis while ensuring data anonymity of the present invention includes a K-value anonymity condition for original data and a generalization level acceptable range for each quasi-identifier attribute A generalization level combining unit for generating all possible generalization level combinations within the allowable range of the generalized level defined by the attribute, and anonymity rate and balance rate for each of the generated generalization level combinations according to the attribute A generalization level combination which evaluates the value of each combination in consideration of the calculated anonymity rate and balance rate calculation unit and the calculated balance rate and anonymity rate of the combination and provides a predetermined number of generalization level combinations in order of value Recommendation section.

한편, 본 발명의 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 방법은, 원본 데이터에 대한 K-익명성 조건인 K 값 및 준식별자 속성별 일반화 수준 허용 범위를 정의하는 일반화 수준 정의 단계,Meanwhile, a public data generating method useful for data analysis while ensuring data anonymity of the present invention includes a generalization level defining step of defining a K value, which is a K-anonymity condition for original data, and a generalization level allowable range for each semi-identifier property,

정의된 상기 속성별 일반화 수준 허용 범위 내에서 가능한 모든 일반화 수준 조합을 생성하는 일반화 수준 조합 생성 단계, 생성된 상기 속성별 일반화 수준 조합들의 각각에 대해 익명률 및 균형률을 계산하는 일반화 수준 조합별 익명률 및 균형률 계산 단계 및 계산된 상기 조합별 균형률 및 익명률을 고려하여 각 조합의 가치를 평가하고 가치가 높은 순으로 소정 개수의 일반화 수준 조합을 제공하는 일반화 수준 조합 추천 단계를 포함한다.A generalization level combination generation step of generating all possible generalization level combinations within the defined generalization level tolerance defined for each attribute, anonymization rate and anonymity ratio calculation method for calculating anonymity rate and balance rate for each of the generated generalization level combinations by attribute, And a generalization level combination recommendation step of evaluating the value of each combination in consideration of the calculated balance rate and anonymity rate of the combination and providing a predetermined number of generalization level combinations in the order of higher value.

상술한 바와 같이, 본 발명에 의한 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 시스템 및 방법은 다음과 같은 효과를 제공한다.As described above, the open data generation system and method useful for data analysis while ensuring data anonymity according to the present invention provides the following effects.

데이터를 공개하는 데 있어서 공개하려는 자가 선택한 일반화 수준 범위 내에서 준식별자 속성의 일반화 수준을 조정하여 최대한 일반화 수준을 낮추면서 데이터 익명성을 만족하는 데이터 일반화 수준을 제공함으로써, 데이터가 가진 민감한 정보의 식별을 막으면서도 공개된 데이터를 다시 데이터 분석에 사용하기 위해 데이터가 가져야 할 충분한 정보를 가질 수 있게 한다.Identifying the sensitive information of the data by providing the level of data generalization that satisfies the data anonymity by adjusting the generalization level of the semi-identifier attribute within the range of the generalization level selected by the person to be disclosed in the data disclosure. But also to have enough information for the data to be used for data analysis again.

도 1은 본 발명의 일 실시예에 따른 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 시스템의 전체 구성을 개략적으로 나타낸 블록도이다.
도 2a는 본 발명의 일 실시예에 따른 예시적인 준식별자 속성별 일반화 수준을 나타낸 표이다.
도 2b는 본 발명의 일 실시예에 따른 예시적인 성별 속성의 일반화 수준의 구조를 트리로 나타낸 도면이다.
도 2c는 본 발명의 일 실시예에 따른 예시적인 시도 속성의 일반화 수준의 구조를 트리로 나타낸 도면이다.
도 2d는 본 발명의 일 실시예에 따른 예시적인 연령 속성의 일반화 수준의 구조를 트리로 나타낸 도면이다.
도 2e는 본 발명의 일 실시예에 따른 예시적인 소득분위 속성의 일반화 수준의 구조를 트리로 나타낸 도면이다.
도 2f는 본 발명의 일 실시예에 따른 예시적인 기준년도 속성의 일반화 수준의 구조를 트리로 나타낸 도면이다.
도 3은 본 발명의 일 실시예에 따라 일반화 수준 조합 생성부에서 생성한 원본 데이터에 대한 구별된 레코드 테이블이다.
도 4는 본 발명의 일 실시예에 따른 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 방법의 전체 흐름을 나타낸 흐름도이다.
도 5는 본 발명의 일 실시예에 따른 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 방법에서의 각 단계별 데이터의 모식도를 나타낸 흐름도이다.FIG. 1 is a block diagram schematically showing an overall configuration of a public data generating system useful for data analysis while ensuring data anonymity according to an embodiment of the present invention. Referring to FIG.
FIG. 2A is a table showing an exemplary generalization level according to an exemplary quasi-identifier attribute according to an exemplary embodiment of the present invention.
FIG. 2B is a tree diagram illustrating the structure of an exemplary gender attribute generalization level according to an embodiment of the present invention.
FIG. 2C is a tree diagram illustrating a structure of a generalization level of an exemplary trial attribute according to an embodiment of the present invention. FIG.
FIG. 2D is a tree diagram illustrating the structure of the generalization level of an exemplary age attribute according to an embodiment of the present invention. FIG.
FIG. 2E is a tree diagram illustrating a structure of a generalization level of an exemplary income quadrant attribute according to an embodiment of the present invention.
FIG. 2F is a diagram showing a structure of a generalized level of an exemplary base year attribute according to an embodiment of the present invention.
3 is a table of distinguished records for original data generated by a generalization level combination generator according to an embodiment of the present invention.
4 is a flowchart illustrating an overall flow of a public data generation method useful for data analysis while ensuring data anonymity according to an embodiment of the present invention.
FIG. 5 is a flowchart illustrating a data structure of each step in a public data generation method useful for data analysis while ensuring data anonymity according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성요소를 지칭한다.BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

그리고 이하의 상세한 설명에서 사용될 용어들을 미리 정의하면 다음과 같다.And the terms used in the following detailed description are predefined as follows.

식별자(Identifier)란 개인을 직접적으로 식별할 수 있게 하기 위해 고유하게 부여되는 속성으로서, 예를 들면 주민등록번호, 여권번호, 외국인등록번호, 운전면허번호, 의료기록번호, 건강보험번호, 복지 수급자 번호 등이 있다. 준식별자(Quasi-Identifier)란 단일 속성으로는 레코드를 식별할 수 없지만 둘 이상의 속성의 조합으로 레코드를 식별할 수 있게 하는 속성으로서, 예를 들면 연령, 성별, 우편번호, 혈액형, 키, 몸무게 등이 있다. 속성값(Attribute value)이란 개인과 관련된 민감한 정보로서 다른 정보와 쉽게 결합하는 경우 특정 개인을 알아볼 수도 있는 정보로서, 예를 들면 연령, 성별, 국적, 주소, 우편번호, 병역 여부, 결혼 여부, 종교, 취미 등이 있다.An identifier is a property that is uniquely assigned to directly identify an individual. For example, a resident registration number, a passport number, an alien registration number, a driver's license number, a medical record number, a health insurance number, . A quasi-identifier is an attribute that can not identify a record as a single attribute but can identify a record by a combination of two or more attributes. For example, age, sex, zip code, blood type, height, weight . Attribute value is sensitive information related to an individual and can easily identify a specific individual when easily combined with other information such as age, gender, nationality, address, postal code, military service, marital status, religion , And hobbies.

배경기술에서 살펴본 바와 같이, 익명화란 공개 데이터가 식별되지 않게 하기 위하여 데이터를 조정하는 것으로서, 본 발명에서는 데이터의 일반화 수준을 조정하여 익명화가 이루어진다. 데이터 일반화 수준은 데이터가 가지고 있는 정보량의 수준을 말하는 것으로, 데이터가 가진 정보량이 많으면 일반화 수준이 낮다고 하고, 데이터가 가진 정보가 적으면 일반화 수준이 높다고 한다. 일반화 수준이 낮은 데이터는 손실된 정보량이 적으며, 정확한 데이터 값으로 인해 분석의 정확도를 높여줄 수 있지만, 일반화 수준이 낮은 채로 데이터가 공개가 되면 많은 정보를 담고 있기 때문에 유출에 민감해진다. 반대로 일반화 수준이 높은 데이터는 손실된 정보량이 크기 때문에 분석의 정확도가 떨어질 수 있으나 데이터가 갖고 있는 정보의 양이 적기 때문에 정보 유출에 민감도가 적다. 본 발명에서는 데이터 일반화 수준의 정의의 대상을 원본 데이터에서 식별자 속성을 제거한 후 원본 데이터의 준식별자 속성으로 한다.As we have seen in the background, anonymization is an adjustment of data in order to prevent identification of open data. In the present invention, anonymization is performed by adjusting the level of generalization of data. The data generalization level refers to the level of information that the data has. If the amount of information possessed by the data is high, the level of generalization is low. If the information possessed by the data is small, the level of generalization is high. Low level of generalized data can reduce the amount of lost information and can improve the accuracy of the analysis due to accurate data values. However, when the data is released with low generalization level, it is sensitive to leakage because it contains a lot of information. On the other hand, data with high generalization level may be less accurate because of the large amount of information lost, but it is less sensitive to information leakage because the amount of information is small. In the present invention, the target of definition of the data generalization level is set as the semi-identifier attribute of the original data after removing the identifier attribute from the original data.

이하 본 발명의 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 시스템에 대하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A public data generating system useful for data analysis while assuring data anonymity of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 시스템의 전체 구성을 개략적으로 나타낸 블록도이다.FIG. 1 is a block diagram schematically showing an overall configuration of a public data generating system useful for data analysis while ensuring data anonymity according to an embodiment of the present invention. Referring to FIG.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 공개 데이터 생성 시스템은 원본 데이터에 대한 K-익명성 조건인 K 값 및 준식별자 속성별 일반화 수준 허용 범위를 정의하는 일반화 수준 정의부(100), 상기 일반화 수준 정의부(100)에서 정의된 속성별 일반화 수준 허용 범위를 기반으로, 속성별 일반화 수준 허용 범위 내에서 가능한 모든 일반화 수준 조합을 생성하는 일반화 수준 조합 생성부(200), 상기 일반화 수준 조합 생성부(200)에서 생성된 속성별 일반화 수준 조합들의 각각에 대해 익명률 및 균형률을 계산하는 일반화 수준 조합별 익명률 및 균형률 계산부(300) 및 상기 일반화 수준 조합별 익명률 및 균형률 계산부(300)에서 계산된 조합별 균형률 및 익명률을 고려하여 각 조합의 가치를 평가하고 가치가 높은 순으로 소정 개수의 일반화 수준 조합을 제공하는 일반화 수준 조합 추천부(400)를 포함하여 이루어진다. 이러한 본 발명의 공개 데이터 생성 시스템을 통해 일반화 수준 조합들을 추천받은 공개자는 추천된 일반화 수준 조합들 중 데이터 분석에 활용하기에 가장 효율적이라고 여겨지는 일반화 수준 조합을 선택할 수 있고, 선택한 일반화 수준 조합을 만족하는 데이터를 생성하여 그 중 익명성 조건을 만족하는 데이터를 공개 데이터로 생성하고 익명성 조건을 만족하지 않는 데이터는 보류 데이터로 생성할 수 있을 것이다.1, a public data generation system according to an exemplary embodiment of the present invention includes a generalization level definition unit that defines a K value, which is a K-anonymity condition for original data, and a generalization level allowable range for each quasi identifier attribute 100), a generalization level combination generator (200) for generating all possible generalization level combinations within an allowable generalization level allowable range based on the allowable generalization level per attribute defined in the generalization level defining unit (100) An anonymity rate and balance rate calculation unit 300 for calculating anonymity rate and balance rate for each of the generalization level combinations for each attribute generated by the generalization level combination generation unit 200 and anonymity rate And the balancing rate calculating unit 300, the value of each combination is evaluated in consideration of the balancing rate and anonymity rate of each combination, and a predetermined number of generalization levels It comprises a generalized level combinations like portion 400 to provide a sum. The disclosed public data generation system of the present invention allows the publicizer who has recommended the generalization level combinations to select a generalized level combination that is considered most efficient for data analysis among the recommended generalized level combinations and satisfies the selected generalized level combination The data that satisfies the anonymity condition is generated as the open data, and the data that does not satisfy the anonymity condition can be generated as the pending data.

일반화 수준 정의부(100)는 원본 데이터에 대한 K-익명성 조건인 K 값 및 준식별자 속성별 일반화 수준 허용 범위를 정의한다. 일반화 수준 정의부(100)는 데이터를 공개하려는 자(이하, 공개자라 함)로부터 공개된 데이터를 분석하려는 특성에 맞는 K 값 및 준식별자 속성별 일반화 수준 허용 범위를 입력받을 수 있다. 속성별 일반화 수준 허용 범위란 공개자가 생각하는 각 속성별 일반화 수준 허용 범위를 말한다. 예를 들어, 어떤 속성의 일반화 수준이 0 수준 내지 5 수준이고 공개자가 원하는 일반화 허용 범위가 1 수준 내지 3 수준이라면, 공개자는 해당 속성의 일반화 수준 허용 범위로 1 수준 내지 3 수준을 입력할 것이다.The generalization level defining unit 100 defines a K-value, which is a K-anonymity condition for the original data, and a generalization level allowable range for each quasi-identifier attribute. The generalization level defining unit 100 may receive a K value corresponding to a characteristic of analyzing data released from a person who wants to disclose data (hereinafter referred to as a publicizer) and a generalization level acceptable range for each quasi identifier attribute. The generalization level allowance for each attribute is the allowable range of generalization level for each attribute that the public thinks. For example, if the generalization level of an attribute is from 0 to 5, and the publicizer has a generalization allowance of 1 to 3, then the publisher will enter the level 1 to 3 as the generalization level tolerance of that attribute.

구체적으로, 본 발명에서는 상기한 바와 같이, 데이터 일반화 수준의 정의의 대상을 원본 데이터의 준식별자 속성으로 하므로, 각 준식별자 속성들의 일반화 수준을 나타내면 도 2a와 같다. 각 속성의 일반화 0 수준은 일반화가 진행되지 않은 원본 데이터 상태이고, 각 속성의 일반화 최대 수준은 속성마다의 특성을 따라 가장 일반화되어 있는 상태이다.Specifically, in the present invention, as described above, the data generalization level is defined as a quasi-identifier attribute of the original data. Thus, the generalization level of each quasi-identifier attribute is as shown in FIG. The generalization 0 level of each attribute is the original data state where no generalization is performed, and the generalization maximum level of each attribute is the most generalized according to the property of each attribute.

예를 들어, 준식별자 속성이 성별, 시도, 연령대, 소득분위, 기준년도라고 하자. 각 준식별자 속성에 따른 일반화 수준 트리를 나타내면 도 2b 내지 도 2f와 같다.For example, suppose that the semi-identifier attribute is gender, attempt, age group, income quintile, base year. The generalization level tree according to each quasi-identifier attribute is as shown in Figs. 2B to 2F.

성별 속성의 일반화 수준은 도 2b와 같은 트리로 구성되며 수준은 0 내지 1로 이루어진다. 1 수준은 성별이 구분되지 않는 성별 전체가 하나의 그룹인 하나의 노드로 이루어져 있으며, 0 수준은 성별이 남, 여인 두 노드로 이루어진다.The generalization level of the gender attribute is composed of a tree as shown in FIG. 2B, and the level is 0 to 1. The first level consists of one node, which is a group of all genderless genders, and the 0 level consists of two nodes, the sex and the sex.

시도 속성의 일반화 수준은 도 2c와 같은 트리로 구성되며 수준은 0 내지 3으로 이루어진다. 3 수준은 시도가 구분되지 않는 시도 전체가 모두 포함된 그룹인 하나의 노드로 이루어져 있고, 2 수준은 중부지방과 남부지방인 두 노드로 이루어져 있으며, 1 수준은 수도권, 강원권, 충청권, 대경권, 동남권, 호남권, 제주권인 7개의 노드로 이루어져 있고, 0 수준은 서울, 경기, 인천, 강원, 충북, 세종, 충남, 대전, 경북, 대구, 울산, 부산, 경남, 전북, 전남, 광주, 제주인 17개의 노드로 이루어진다.The generalization level of the trial attribute is composed of a tree as shown in FIG. 2C, and the level is 0 to 3. The third level is composed of one node, which is a group containing all the attempts that are not divided into trials. The second level is composed of two nodes, Central and South, and the first level is the metropolitan area, Kangwon area, Chungcheong area, It is composed of 7 nodes of Southeast, Honam and Cheju. 0 level is Seoul, Gyeonggi, Incheon, Gangwon, Chungbuk, Sejong, Chungnam, Daejeon, Gyeongbuk, Daegu, Ulsan, Busan, Gyeongnam, Jeonbuk, Jeonnam, It consists of 17 nodes.

연령대 속성의 일반화 수준은 도 2d와 같은 트리로 구성되며 수준은 0 내지 4로 이루어진다. 연령대 속성은 0~99세까지 5세 단위로 그룹화하고, 100세 이상은 ‘100세 이상’으로 그룹화하였다. 0~4세는 리프 노드에서 0으로 표현하고, 5~9세는 1로 표현하는 등, 이와 같이 0~99세까지를 그룹화하여 노드 0 내지 19로 표현하고 노드 20은 100세 이상을 표현한다. 4 수준은 전체 연령대를 하나의 그룹으로 한 노드로 이루어져있고, 3 수준은 0~12, 13~20인 두 노드로 나누어지며, 2 수준은 0~4, 5~8, 9~12, 13~16, 17~20인 5개의 노드로 이루어져 있다. 1 수준은 0~2, 3~4, 5~6, 7~8, 9~10, 11~12, 13~14, 15~16, 17~18, 19~20인 10개의 노드로 이루어져 있고, 마지막 0 수준은 앞에서 설명했던 5세 단위로 그룹화된 노드 21개로 이루어진다.The generalization level of the age attribute is composed of a tree as shown in FIG. 2D, and the level is 0 to 4. Age group attributes are grouped into 5-year-olds from 0 to 99 years old, and those over 100 years old are grouped as '100 years old or older'. 0 to 4 years are expressed as 0 in the leaf node, 5 to 9 years are expressed as 1, and so on. Thus, groups 0 to 99 are represented as nodes 0 to 19, and node 20 is represented as 100 years or more. 4 levels are divided into two groups of 0 ~ 12 and 13 ~ 20 levels in the 3rd level and 0 ~ 4, 5 ~ 8, 9 ~ 12, 13 ~ 16, and 17 to 20 nodes. 1 level is composed of 10 nodes of 0 to 2, 3 to 4, 5 to 6, 7 to 8, 9 to 10, 11 to 12, 13 to 14, 15 to 16, 17 to 18, 19 to 20, The last 0 level consists of 21 nodes grouped into the 5-year-old units described above.

소득분위 속성의 일반화 수준은 도 2e와 같은 트리로 구성되며 수준은 0 내지 3으로 이루어진다. 소득분위는 일반적인 기준인 10분위로 노드가 생성되어있다. 3 수준은 전체 소득분위가 하나의 그룹인 하나의 노드로 이루어져 있으며, 2 수준은 1~6, 7~10인 두 노드로 이루어져 있다. 1 수준은 1~2, 3~4, 5~6, 7~8, 9~10인 5개의 노드로 이루어져 있으며, 0 수준은 0~10분위 소득분위로 10개의 노드로 이루어진다.The generalization level of the income class attribute is composed of a tree as shown in FIG. 2E, and the level is 0 to 3. The income quintile is a node, which is a general criterion. The third level consists of one node, which is a group of all income quintiles. The second level consists of two nodes, 1 ~ 6, 7 ~ 10. The first level consists of five nodes of 1 ~ 2, 3 ~ 4, 5 ~ 6, 7 ~ 8, 9 ~ 10, and 0 level consists of 10 nodes over the income of 0 ~ 10.

기준년도 속성의 일반화 수준은 도 2f와 같은 트리로 구성되며 수준은 0 내지 3으로 이루어진다. 기준년도는 2006년부터 2015년까지의 범위를 사용하였다. 3 수준은 전체 기준년도 그룹인 하나의 노드로 이루어져 있으며, 2 수준은 2006~2011, 2012~2015인 두 노드로 이루어져있다. 1 수준은 2006~2007, 2008~2009, 2010~2011, 2012~2013, 2014~2015인 5개의 노드로 이루어져 있으며, 0 수준은 2006~2015년 각각의 년도인 10개의 노드로 이루어진다.The generalization level of the base year attribute is composed of a tree as shown in FIG. 2f, and the level is 0 to 3. The base year range is from 2006 to 2015. The third level consists of one node, which is the entire base year group. The second level consists of two nodes, 2006 ~ 2011 and 2012 ~ 2015. The first level consists of five nodes: 2006 ~ 2007, 2008 ~ 2009, 2010 ~ 2011, 2012 ~ 2013, and 2014 ~ 2015, and the 0 level consists of 10 nodes each year of 2006 ~ 2015.

일반화 수준 조합 생성부(200)는 일반화 수준 정의부(100)에서 정의된 속성별 일반화 수준 허용 범위를 기반으로, 각 속성의 일반화 수준 범위를 만족하는 모든 일반화 수준 조합을 생성한다. The generalization level combination generation unit 200 generates all the generalization level combinations that satisfy the generalization level range of each attribute based on the generalization level allowable range defined by the generalization level definition unit 100.

상기한 예로서, 준식별자 속성들 성별, 시도, 연령대, 소득분위, 기준년도에 대해서 일반화 수준 정의부(100)에서 정의된 속성별 일반화 수준 허용 범위(도 5의 박스 510)에 따라 생성된 일반화 수준 조합들을 나타내면 도 5의 박스 520과 같다. 도 5의 박스 520을 참조하면, 일반화 수준 조합의 1열에는 정의된 성별 속성 일반화 수준 허용 범위 내에서 가능한 모든 일반화 수준이 오고, 2열에는 정의된 시도 속성 일반화 수준 허용 범위 내에서 가능한 모든 일반화 수준이 오고, 3열에는 정의된 소득분위 속성 일반화 수준 허용 범위 내에서 가능한 모든 일반화 수준이 오는 등이다.As an example of the above, it is assumed that the generic identifier attributes (generality) generated according to the generalization level tolerance range (box 510 in FIG. 5) defined by the generalization level defining unit 100 for gender, attempt, age group, Level combinations are the same as box 520 of FIG. Referring to box 520 of FIG. 5, column 1 of the generalization level combination has all possible generalization levels within the defined gender attribute generalization level tolerance, and column 2 contains all possible generalizations levels , And column 3 contains all possible generalizations within the defined income level attribute generalization level allowance.

그리고, 일반화 수준 조합 생성부(200)는 원본 데이터에 대해서 생성된 각각의 일반화 수준 조합들을 만족하는 데이터를 생성하고, 생성된 데이터에서 각 속성이 같은 값을 갖는 레코드들을 하나의 집합으로 하는 구별된 레코드 테이블을 생성한다.The generalization level combination generator 200 generates data satisfying the respective generalization level combinations generated for the original data, and generates a discrimination result for each of the records having the same value of each attribute in the generated data as a set Create a record table.

상기한 예로서, 준식별자 속성이 성별, 시도, 연령대, 소득분위, 기준년도인 원본 데이터(도 5의 박스 500)에 대해서 생성된 각각의 일반화 수준 조합들(도 5의 박스 520)에 따라 전처리된 데이터를 나타내면 도 5의 박스 530과 같다.As an example of the above, according to the respective generalization level combinations (box 520 in FIG. 5) generated for the original data (box 500 in FIG. 5) in which the semi-identifier attribute is gender, attempt, age range, 5 is the same as box 530 of FIG.

도 3은 본 발명의 일 실시예에 따라 일반화 수준 조합 생성부에서 생성한 원본 데이터에 대한 구별된 레코드 테이블이다. 원본 데이터의 레코드들 중 각 속성이 같은 값을 갖는 레코드들이 묶여서 그 크기 정보와 함께 저장되어 있다. 도 3을 참조하면, 원본 데이터의 네 개의 레코드가 같은 값을 갖고 있기 때문에, 구별된 레코드 테이블에 해당 네 개의 레코드가 하나의 집합으로 묶이고, '4'라는 크기 정보도 함께 저장된다. 아래에서 일반화 수준 조합별 익명률 및 균형률 계산부(300)에 대하여 설명될 바와 같이, 일반화 수준 조합별 익명률 및 균형률 계산부(300)가 익명성 검사를 할 때, 익명성 조건인 K 값과 각 집합의 크기를 비교한다. 익명성 조건인 K 값이 10인 경우, 도 3에 예시된 구별된 레코드 테이블 중 레코드 집합의 크기가 4인 집합은 10보다 작기 때문에 익명성 조건을 만족하지 않는다고 한다.3 is a table of distinguished records for original data generated by a generalization level combination generator according to an embodiment of the present invention. Records having the same value among the records of the original data are bundled and stored together with their size information. Referring to FIG. 3, since the four records of the original data have the same value, the four records are grouped into one set in the distinguished record table, and the size information of '4' is also stored together. As described below with respect to the anonymity rate and balance rate calculation unit 300 according to the generalization level combination, when the anonymity rate and balance rate calculation unit 300 according to the generalization level combination performs the anonymity test, the anonymity condition K Compare the value with the size of each set. In the case of the anonymity condition K being 10, it is said that the anonymity condition is not satisfied because the set having the size 4 of the records in the distinguished record table illustrated in FIG. 3 is smaller than 10.

일반화 수준 조합별 익명률 및 균형률 계산부(300)는 일반화 수준 조합 생성부(200)에서 생성된 데이터를 기반으로, 일반화 수준 조합 생성부(200)에서 생성된 속성별 일반화 수준 조합들의 각각에 대해 익명률 및 균형률을 계산한다.Based on the data generated by the generalization level combination generator 200, the anonymity rate and the balance rate calculator 300 for each generalization level combination can calculate the sum of the generalization level combinations for each attribute generated by the generalization level combination generator 200 Calculate the anonymity rate and the balance rate.

첫째로, 익명률(Anonymity Rate)이란 전체 데이터에 대해 일반화 수준 정의부(100)에서 정의된 익명성 만족 조건인 K 값에 대해서 익명성 만족 검사를 통해 공개 데이터로 생성될 수 있는 데이터의 비율을 말한다. 일반화 수준 조합에 대해 익명률을 계산하여 이 값이 클수록 가치가 높은 일반화 수준 조합으로 정의한다.First, anonymity rate refers to the ratio of data that can be generated as public data through the anonymity satisfaction test for the K value, which is the anonymity satisfaction condition defined in the generalization level definition unit 100, It says. The anonymity rate is calculated for the generalization level combination, and the higher the value, the higher the generalization level combination.

이러한 익명률은 전체 데이터 수(n)에 대한 익명성 조건 만족 데이터 수(n')의 비율이다. 익명률은 다음 수학식 1과 같이 나타낼 수 있다:These anonymous rate is the ratio of the number of satisfactory data anonymity conditions (n ') to the total number of data (n). The anonymity rate can be expressed as: < RTI ID = 0.0 >

익명률은 0보다 크고 1보다 작은 수의 범위를 갖는다. 예를 들어, 특정 조합The anonymity rate is greater than 0 and less than 1. For example,

의 일반화 수준을 만족하는 경우, 전체 데이터 레코드가 1000개이고 이 중 익명성 조건인 K 값을 만족하는 레코드의 수가 800개 일 때, 이 조합의 익명률은 800/1000 = 0.8이다., The anonymity rate of this combination is 800/1000 = 0.8 when the total number of data records is 1000 and the number of records satisfying the K value of the anonymity condition is 800.

둘째로, 균형률(Balance Rate)이란 일반화 수준 조합 생성부(200)에서 각 일반화 수준 조합에 따라 생성된 구별된 레코드 테이블(도 3 참조)에서 각 레코드 집합의 크기의 균등한 정도를 말한다. 균형률이 높은 일반화 수준 조합은 레코드가 특정 집합에 몰려 있는 것 없이, 균등하게 분포되어 있음을 증명한다.Second, the balance rate refers to the degree of uniformity of the size of each record set in the differentiated record table (see FIG. 3) generated according to each generalized level combination in the generalized level combination generator 200. A generalized level combination with a high degree of balancedness proves that the records are evenly distributed without being concentrated in a particular set.

이러한 균형률은 구별된 레코드 테이블의 전체 레코드 집합 수(m)에 대한 익명성 조건 만족 레코드 집합 수(m')로 계산된다. 구별된 레코드 테이블의 균형률은 다음 수학식 2와 같이 나타낼 수 있다:This balancing rate is calculated as the number of record sets satisfying the anonymity condition ( m ' ) for the total number ( m ) of records in the distinguished record table. The balance rate of the distinguished record table can be expressed by the following Equation 2:

수학식 2의 m은 이상적인 전체 레코드 집합 수이다. 전체 데이터가 있고, 익명성 만족 조건인 K 값이 주어졌을 때, 전체 레코드의 크기가 전부 K를 만족했다고 가정했을 때의 전체 레코드 집합 수이다. m'은 크기가 K보다 큰 레코드 집합 수이다 예를 들어, 전체 데이터 레코드의 수가 1000개일 때, 익명성 조건 K 값이 3이라면, m은 33이라는 값을 갖는다. 이때 익명성을 만족하는 레코드 집합이 20개라면, 균형률은 20/33 = 0.606...으로 계산된다. 이와 같이, 균형률은 구별된 레코드 테이블에서 각 레코드 집합의 크기가 얼마나 균형 있는지를 나타내며, 균형률을 보고 레코드가 하나의 집합에 몰려 있거나 균등한지를 알아볼 수 있다.In Equation (2), m is an ideal total number of record sets. Given an entire data set and an anonymity satisfaction criterion, given the value K, the total number of records is the total number of records assuming that the size of the entire record satisfies K. m ' is the number of recordsets whose size is greater than K. For example, if the total number of data records is 1000, and the anonymity condition K is 3, then m has a value of 33. At this time, if there are 20 recordsets satisfying anonymity, the balance rate is calculated as 20/33 = 0.606 .... Thus, the balancing rate indicates how much the size of each set of records is balanced in the distinguished record table, and whether the records are clustered or evened in one set by looking at the balance rate.

상기한 예로서, 전처리된 데이터(도 5의 박스 530)에 대해서 각각의 일반화 수준 조합들(도 5의 박스 520)에 대해 계산된 조합별 익명률 및 균형률을 나타내면 도 5의 박스 540과 같다.As an example of the above, it is the same as Box 540 of FIG. 5 to represent the combined anonymity and balance ratios calculated for the respective generalization level combinations (box 520 in FIG. 5) for the preprocessed data (box 530 in FIG. 5) .

일반화 수준 조합 추천부(400)는 일반화 수준 조합별 익명률 및 균형률 계산부(300)에서 계산된 조합별 균형률 및 익명률을 고려하여 각 조합의 가치를 평가하고 가치가 높은 순으로 소정 개수의 일반화 수준 조합을 제공한다. 이때, 각 조합이 갖는 가치 값을 함께 제공할 수도 있다.The generalization level combination recommendation unit 400 evaluates the values of each combination in consideration of the balance rate and anonymity rate of each combination calculated in the anonymity rate and balance rate calculation unit 300 for each generalization level combination, Lt; / RTI > At this time, the value of each combination may be provided together.

구체적으로, 일반화 수준 조합 추천부(400)는 일반화 수준 조합별 익명률 및 균형률 계산부(300)에서 계산된 익명률 및 균형률에 대해서 모든 일반화 수준 조합 중 균형률이 최대일 때의 익명률보다 큰 익명률을 갖는 조합들을 찾고, 그것들 중 익명률 및 균형률이 그것들의 최적의 값인 (1,1)과 가까운 순으로 소정 개수의 일반화 수준 조합을 제공한다.Specifically, the generalization level combination recommendation unit 400 refers to anonymity rate and anonymity rate when the balance rate is the highest among all generalization level combinations for the anonymity rate and the balance rate calculated in the anonymity rate and balance rate calculation unit 300 according to the generalization level combination Find combinations with larger anonymity rates, and provide a certain number of combinations of generalization levels in the order of their anonymity and equilibrium rates closer to their optimal value (1,1).

상기한 예로서, 각각의 일반화 수준 조합들(도 5의 박스 520)에 대해 계산된 조합별 익명률 및 균형률(도 5의 박스 540)에서 균형률이 가장 클 때의 익명률 값이 0.8이라면, 익명률이 0.8 이상인 일반화 수준 조합들(본 예에서, 02220(0.525, 0.824), 02112(0.525, 0.814), 02022(0.510, 0.863), 02202(0.519, 0.803))을 찾는다. 찾아진 조합들에 대해 익명률 및 균형률이 그것들의 최적의 값인 (1,1)과 가까운 순으로 소정 개수의 일반화 수준 조합을 제공한다.As an example of the above, if the anonymity rate value when the balance rate is the largest in the combination anonymity rate and balance rate (box 540 in FIG. 5) calculated for each generalization level combination (box 520 in FIG. 5) (0.525, 0.824), 02112 (0.525, 0.814), 02022 (0.510, 0.863), 02202 (0.519, 0.803) in anonymity ratio of 0.8 or more. Provides a predetermined number of combinations of generalization levels for the combinations found in order of their closeness to the optimal value of (1, 1).

덧붙여, 일반화 수준 조합들을 추천받은 공개자는 추천된 일반화 수준 조합들 중 데이터 분석에 활용하기에 가장 효율적이라고 여겨지는 일반화 수준 조합을 선택할 수 있고, 선택한 일반화 수준 조합을 만족하는 데이터를 생성하여 그 중 익명성 조건을 만족하는 데이터를 공개 데이터로 생성하고 익명성 조건을 만족하지 않는 데이터는 보류 데이터로 생성할 수 있을 것이다.In addition, a publisher who recommends generalization level combinations can choose among the recommended generalization level combinations the generalization level combinations that are considered most efficient for data analysis, generate data satisfying the selected generalization level combination, The data that satisfies the condition of sex is generated as the public data, and the data that does not satisfy the condition of anonymity can be generated as the pending data.

추가로, 본 발명의 일 실시예에 따른 공개 데이터 생성 시스템은 데이터베이스부(500)를 더 구비할 수 있고, 데이터베이스부(500)는 공개하려는 원본 데이터를 저장하는 원본 데이터 DB(510), 일반화 수준 정의부(100)에서 정의된 K 값 및 준식별자 속성별 일반화 수준 허용 범위를 저장하는 일반화 수준 DB(520), 일반화 수준 조합 생성부(200)에서 생성된 일반화 수준 조합들, 생성된 일반화 수준 조합들에 따라 전처리된 데이터, 전처리된 데이터들에서 구별된 레코드 테이블을 저장하는 일반화 수준 조합 DB(530), 속성별 일반화 수준 조합들의 각각에 대해 계산된 익명률 및 균형률을 저장하는 익명률 및 균형률 DB(540)로 이루어질 수 있다.In addition, the open data generation system according to an embodiment of the present invention may further include a database unit 500. The database unit 500 includes an original data DB 510 for storing original data to be disclosed, A generalization level DB 520 for storing a K value defined in the definition unit 100 and a generalization level allowable range for each quasi identifier attribute, generalization level combinations generated in the generalization level combination generation unit 200, A generalization level combination DB 530 for storing a record table distinguished from the preprocessed data and the preprocessed data in accordance with the anonymity rate and the balance rate calculated for each of the generalization level combinations by attribute, Rate DB 540. [0052]

그러면, 여기서 상기와 같이 구성된 시스템을 이용한 본 발명의 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 방법에 대해 설명하기로 한다. 이를 상세히 설명하면 다음과 같다.Hereinafter, a public data generation method useful for data analysis while ensuring data anonymity of the present invention using the system configured as described above will be described. This will be described in detail as follows.

도 4는 본 발명의 일 실시예에 따른 데이터 익명성을 보장하면서 데이터 분석에 유용한 공개 데이터 생성 방법의 전체 흐름을 나타낸 흐름도이다.4 is a flowchart illustrating an overall flow of a public data generation method useful for data analysis while ensuring data anonymity according to an embodiment of the present invention.

도 4를 참조하면, 먼저 일반화 수준 정의부(100)가 원본 데이터(도 5의 박스 500)에 대한 K-익명성 조건인 K 값 및 준식별자 속성별 일반화 수준 허용 범위(도 5의 박스 510)를 정의한다(S410). 이때, 일반화 수준 정의부(100)는 데이터를 공개자로부터 공개된 데이터를 분석하려는 특성에 맞는 K 값 및 준식별자 속성별 일반화 수준 허용 범위를 입력받을 수 있다.4, first, the generalization level defining unit 100 sets the K value, which is the K-anonymity condition for the original data (box 500 in FIG. 5), and the generalization level allowable range (box 510 in FIG. 5) (S410). At this time, the generalization level defining unit 100 may receive the K value corresponding to the characteristic of analyzing the data released from the discloser and the generalization level allowable range according to the semi-identifier attribute.

다음으로, 일반화 수준 조합 생성부(200)가 일반화 수준 정의부(100)에서 정의된 속성별 일반화 수준 허용 범위를 기반으로, 속성별 일반화 수준 허용 범위 내에서 가능한 모든 일반화 수준 조합(도 5의 박스 520)을 생성한다(S420).Next, the generalization level combination generator 200 generates all possible generalization level combinations (box in FIG. 5) based on the generalization level allowable range per attribute defined in the generalization level defining unit 100, 520) (S420).

그리고, 일반화 수준 조합 생성부(200)는 원본 데이터에 대해서 생성된 각각의 일반화 수준 조합들을 만족하는 데이터(도 5의 박스 530)를 생성하고, 생성된 데이터에서 각 속성이 같은 값을 갖는 레코드들을 하나의 집합으로 하는 구별된 레코드 테이블(도 3 참조)을 생성한다(S430).Then, the generalization level combination generator 200 generates data (box 530 in FIG. 5) that satisfies the respective generalization level combinations generated for the original data, and stores the records having the same value in the generated data A separate record table (see FIG. 3) having one set is generated (S430).

그 다음, 일반화 수준 조합별 익명률 및 균형률 계산부(300)가 일반화 수준 조합 생성부(200)에서 생성된 데이터를 기반으로, 일반화 수준 조합 생성부(200)에서 생성된 속성별 일반화 수준 조합들의 각각에 대해 익명률 및 균형률(도 5의 540)을 계산한다(S440). 이때, 익명률 및 균형률을 계산하는 방법은 일반화 수준 조합별 익명률 및 균형률 계산부(300)에 대해서 상술한 바와 같다.Then, based on the data generated by the generalization level combination generator 200, the anonymity rate and balance rate calculation unit 300 for each generalization level combination generates the generalization level combination for each attribute generated by the generalization level combination generator 200 Anonymity rate and balance rate (540 in FIG. 5) are calculated for each of the users (S440). At this time, the method of calculating the anonymity rate and the balance rate is as described above with respect to the anonymity rate and the balance rate calculation unit 300 according to the generalization level combination.

마지막으로, 일반화 수준 조합 추천부(400)가 일반화 수준 조합별 익명률 및 균형률 계산부(300)에서 계산된 조합별 균형률 및 익명률을 고려하여 각 조합의 가치를 평가하고 가치가 높은 순으로 소정 개수의 일반화 수준 조합(도 5의 박스 550)을 제공한다. 이때, 일반화 수준 조합 추천부(400)는 일반화 수준 조합별 익명률 및 균형률 계산부(300)에서 계산된 익명률 및 균형률에 대해서 모든 일반화 수준 조합 중 균형률이 최대일 때의 익명률보다 큰 익명률을 갖는 조합들을 찾고, 그것들 중 익명률 및 균형률이 그것들의 최적의 값인 (1,1)과 가까운 순으로 소정 개수의 일반화 수준 조합을 제공한다.Finally, the generalization level combination recommendation unit 400 evaluates the values of the respective combinations in consideration of the balance rate and anonymity rate calculated in the anonymity rate and balance rate calculation unit 300 for each generalization level combination, (Box 550 of FIG. 5). &Lt; / RTI > At this time, the generalization level combination recommender 400 determines whether the anonymity rate and the balance rate calculated by the anonymity rate and balance rate calculation unit 300 according to the generalization level combination are greater than the anonymity rate when the balance rate is the maximum among all the generalization level combinations Find combinations with large anonymity rates, and provide a predetermined number of combinations of generalization levels in order of their anonymity and balance rate being close to their optimal value (1,1).

이상에서 몇 가지 실시예를 들어 본 발명을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것은 아니고 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the present invention is not limited to the disclosed exemplary embodiments, but various modifications may be made without departing from the spirit of the invention.

100 : 일반화 수준 정의부
200 : 일반화 수준 조합 생성부
300 : 일반화 수준 조합별 익명률 및 균형률 계산부
400 : 일반화 수준 조합 추천부
500 : 데이터베이스부100: generalization level definition unit
200: generalization level combination generator
300: Anonymity rate and balance rate calculation unit by generalization level combination
400: generalization level combination recommendation unit
500:

Claims

A generalization level defining unit that defines a K-anonymity condition for the original data and a K-value and a generalization level acceptable range for each semi-identifier attribute;
A generalization level combination generator for generating all possible generalization level combinations within the defined generalization level tolerance for each attribute;
Anonymity rate and balance rate calculation unit for each combination of generalization level combinations for calculating anonymity rate and balance rate for each of the generated generalization level combinations; And
And a generalization level combination recommendation unit for evaluating values of each combination in consideration of the calculated balance ratio and anonymity rate of the combination, and providing a predetermined number of generalization level combinations in order of value,
Wherein the generalization level combination generator generates data satisfying each of the generalization level combinations generated for the original data and generates a discriminated record in which the records having the same value as each attribute in the generated data are a set Create a table,
Wherein the anonymity rate is a ratio of the K value satisfying data to the entire original data,
The balance rate represents an equal degree of size of each record set in the distinguished record table,
Wherein the generalization level combination recommendation unit selects a combination of a predetermined number of generalization level combinations in order of the anonymity rate and the balance rate closer to 1 out of the combinations having anonymity rate larger than the anonymity rate when the balance rate is maximum among all the generalization level combinations generated, To the public data.

delete

A generalized level definition step that defines a K-anonymity condition for the original data and a K-value and a generalization level allowable range for each semi-identifier property;
A generalization level combination generation step of generating all possible generalization level combinations within the defined generalization level tolerance for each attribute;
Anonymity rate and balance rate calculation step of calculating anonymity rate and balance rate for each of the generated generalization level combinations by the generalization level combination; And
Evaluating the values of the respective combinations in consideration of the calculated balance rate and anonymity rate of the combination, and providing a predetermined number of combinations of generalized levels in order of value;
Wherein the generalization level combination generation step comprises:
Generating data satisfying each of the generalization level combinations generated for the original data; And
Generating a distinguished record table in which the records having the same value of each attribute in the generated data are a set,
Wherein the anonymity rate is a ratio of the K value satisfying data to the entire original data,
The balance rate represents an equal degree of size of each record set in the distinguished record table,
The generalization level combining recommendation step may include a step of, in a case where the anonymity rate and the balance rate are close to 1 among the combinations having anonymity rate larger than the anonymity rate when the balance rate is the maximum among all the generalization level combinations generated, A combination of the public data and the public data.

delete