Shti 264 Shti190246
Shti 264 Shti190246
Shti 264 Shti190246
net/publication/335381560
CITATIONS READS
0 124
14 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Lianne Ippel on 10 October 2019.
partitioned data that are prepared in line with the FAIR principles consent) and purpose for which each data provider obtained the
(Findable, Accessible, Interoperable, Reusable) [12]. By personal data, and is further analyzing the legal basis and purpose
describing data using the FAIR principles, the infrastructure for which secondary processing can occur. Options that are being
becomes ambivalent to certain syntactic data structures (e.g. considered include but are not limited to the route of compatible
OHDSI, CDISC-ODM or HL7 v2/v3/FHIR), as the applications, processing and the route of scientific research in the public
executed at the data source, should be able to interpret different interest. Additionally, there are a number of limitations from the
types of data structures. To test the feasibility of this data providers themselves regarding accessing, sharing, and
infrastructure, we combine data from two independent data linking data. In addition, for this challenge, a legal framework
providers to investigate how Type 2 Diabetes Mellitus (T2DM) has to be formulated in order to establish collaborations between
status affects healthcare cost. The first dataset comes from the the data providers, among themselves and with the research team.
Maastricht Study1, an observational prospective population- Constructing this legal framework and finding the proper legal
based cohort study focusing on the etiology of T2DM, and the basis for the researchers’ team is a valuable contribution from the
second comes from the official statistics office in the ELSI team.
Netherlands: Statistics Netherlands 1 (Centraal Bureau voor de Technical Perspective
Statistiek; CBS). We present preliminary results involving
Following the PHT architecture2, we use the concepts of (FAIR
simulated data and discuss the challenges and feasibility of such
data) stations3, rails (infrastructure) and (applications) trains. The
an infrastructure to be scalable and secure.
minimal requirement of a FAIR data station is to enable
execution of applications, where data providers decide whether
Methods to execute the application. These FAIR data stations are based on
Semantic Web technologies such as the Resource Description
In this section, we describe the development of our proposed Framework (RDF) [19], to convert the source data4, and make
infrastructure from a scientific, technical, and legal perspective the converted data FAIR.
to support the workflow. Following is the description of our
Application (train) developers (i.e., researchers) can create the
simulation experiment to test the usability of our infrastructure.
application trains using Docker containers [20], which are
Development Workflow lightweight virtual machines. The Docker container carries all
required software packages to execute the application on board.
The PHT architecture has been previously used to analyze
These applications can for instance query data available in the
horizontally partitioned datasets [13–16]. Here, we extend this
data station, perform data cleaning/formatting, and execute
work to include vertically partitioned data. While several studies
machine learning or statistical analysis [15]. Only the results of
discuss exchanging and analyzing vertically partitioned data
these (analytical) applications are sent back to the application
[17,18], these are largely theoretical and overlook practical
developers.
challenges, e.g. legal and ethical considerations, incompatible
data management standards, scalability of the infrastructure, lack To implement the proposed infrastructure, we created three
of financial support to sustain such efforts, and the technical stations. Two FAIR data stations are at the Maastricht Study and
requirements of learning from vertically partitioned data. To at CBS. A third station was configured as a “Trusted Secure
tackle these challenges, our team has established three Environment” (TSE), containing no data by itself, however,
interlocking work packages that target: i) the scientific questions acting as a trusted and independent entity. Additionally, we
in the medical domain; ii) the ethical, legal, and societal issues; created two application trains. The first application train extracts
and iii) the technical aspect. These packages are highly the data from two data stations, pseudonymizes the personal
intertwined to ensure the development of practical solutions. identifiers, encrypts the dataset, and sends the data to the TSE
station. The second application train decrypts the data and
Scientific Perspective
analyzes the data at the TSE. For every execution, both
To develop infrastructure that is useful to scientific researchers, application trains are configured for proper encryption and
we have identified key research questions that the infrastructure security measures.
should help answer. Answering these research questions should
require the combination of sensitive (non-public) data from Experiment Design
multiple providers. To combine data from multiple providers, a Prior to feeding our infrastructure with real data, we conducted a
substantive set of individuals should be shared by the providers simulation experiment with two scenarios where researchers
and at least some attributes of these individuals are present in combine data from two independent providers using a TSE
both datasets to enable linking of the data records (and not station. We monitor time to obtain the analytical results for each
necessarily by some specific individual identifier). scenario. Scenario 1 consists of two providers, A and B, each
ELSI Perspective having the same (small) number of individuals; Scenario 2
The Ethical, Legal, and Societal Issues (ELSI) team deals with consists of providers A and B, but provider B has a much larger
two types of challenges: i) privacy concerns that arise from the set of individuals, including all Provider A’s individuals. For
special nature of personal health data3; and ii) the legal challenges these scenarios, we use data from a publicly available dataset
that arise from working with multiple data providers with each a which contains attributes that could be interpreted as sex, body
distinct governance framework. Combining data from multiple mass index (BMI), number of children, smoking status, region,
parties is a relatively new phenomenon, and often not foreseen and health insurance reimbursement of participants [21].
when establishing the legal framework when the data are Additionally, we generated artificial personal identifiers
collected. Therefore, one of the major challenges has been to including date of birth, zip code, house number, and sex for
facilitate this study whilst adhering to the original legal linking purpose [22]. In practice, combining multiple datasets
framework and defined purpose. In doing so, the ELSI team has might be prone to record-linking errors. We will discuss this in
examined the reach of the original legal basis (i.e. informed more detail in the Discussion section. Please find this synthetic
1 3
Statistics Netherlands is a Dutch governmental institution that gathers FAIR stations: http://github.com/maastroclinic/DataFAIRifier
4
statistical information about the Netherlands: https://www.cbs.nl/en-gb Convert CSV file to RDF file:
2 PHT architecture: https://bitbucket.org/jvsoest/pytaskmanager.git https://github.com/sunchang0124/FAIRHealth/
C. Sun et al. / A Privacy-Preserving Infrastructure for Analyzing Personal Health Data in a Vertically Partitioned Scenario 375
dataset in Figshare5. This dataset is vertically split over the two sensitive data such as personal identifiers. Thus, in addition to
providers: both have artificial personal identifiers (date of birth, pseudonymization and encryption, the privacy of information is
zip code, house number, and sex). Only Provider A has BMI, further protected as no data provider has direct access to the TSE.
number of children, and smoking status, while only Provider B After executing the analytical algorithms on the merged dataset,
has living region and health insurance reimbursement. In the TSE checks whether the results reveal any personal
scenario 1, both providers have 1338 patients. In scenario 2, identifiable information. Only the validated results such as
Provider A still has 1338 patients while Provider B hosts 64,400 figures and/or tables that do not contain any personal identifiable
patients. Since, Provider A in the second scenario only hosts a information are returned to the researchers. Finally, all (received
small subset of Provider B, a single record of Provider A might and created) data in the TSE are destroyed.
match with several records from Provider B. Even though this
scenario is often encountered in practice, few solutions are
available to address this linking challenge for vertically
partitioned data [23].
For our experiment, we developed application trains using
Docker 18.03.1. Pseudonymization, encryption, verification, and
record linkage were implemented in Python 2.7. The
infrastructure was tested with a 2.5GHz PC with 16GB RAM and
500 GB hard disk.
Result
In this section, we detail the contributions of each of the three
work packages. Next, we discuss the outcome of the experiment.
Figures 1 and 2 provide an illustration of the infrastructure. In
Figure 1, an overview of the operational framework for two Figure 1 - Conceptual overview of the proposed infrastructure.
providers, A and B, and a trusted secure environment, TSE, is Data access is regulated by the data provider hosting the
presented. In Figure 2, we present the technical and legal stations. If access is granted, the data providers encrypt the
requirements of the FAIR data stations. Researchers request data and send these to the TSE. The TSE executes the
permission to access and process data from the data provider. researchers’ application and allows aggregated results to be
Once permission is granted, application trains to pseudonymize returned to the researcher.
and encrypt the data are sent and executed in the data stations.
Next, the encrypted data are sent to the TSE, followed by the data
analysis application (from the researchers).
In the FAIR data stations (Figure 2), personal identifiers are
pseudonymized by one-way hashing and salting techniques.
One-way hashing turns any format of data into a fixed-length
"fingerprint" that cannot be reversed. Salt, as a random string, is
appended to data before hashing, to eliminate the risk of
malicious decryption. We used Secure Hash Algorithm 2 (SHA-
512) as the one-way hashing function and random salts are shared
by two data providers to make personal identifiers
pseudonymized on both sites. This results in a unique code per
record, allowing linking the same records from all data providers.
Every time data providers grant researchers permission to
process/analyze the data, the personal identifiers get Figure 2 - Overview of data stations and application trains.
pseudonymized using different salts. The salt needs to be created Within each station, data are prepared, i.e., legal conditions are
and agreed upon by all data providers. Additionally, to safeguard checked, FAIR principles implemented, personal identifiers
secure transfer, processed data are encrypted, prior to sending pseudonymized, and encrypted. The application train enters the
them to TSE. The same as with the salt, encryption keys are re- data station with algorithms and leaves with results or
generated every time. processed data.
The procedure then continues as follows: when the encrypted
data are sent to the TSE, a notification is generated by the data Simulation Experiment
stations to confirm the successful execution and departure of the We used our proposed infrastructure to analyze synthetic data
train. After all encrypted data arrive at the TSE station, the (discussed in the Methods section) that was vertically partitioned
researchers trigger analysis at the TSE with a set of keys and an to form two datasets, each with a different data provider. Figure
application that includes code for the analysis. There is one 3 shows one such result: a plot of BMI and health insurance
private key per data station to decrypt the dataset, and one reimbursement over one calendar year. While simple, the
verification key to test the dataset integrity. The data station can simulation experiment provides evidence for the feasibility of the
only encrypt using the public key but cannot decrypt. The TSE infrastructure to execute an analysis, in this case, retrieval of a
station maintains the private key to decrypt for this specific data relation between two attributes in separate datasets in a secure
provider. After getting verified and decrypted data from both and privacy-preserving manner.
providers, the data can be linked and merged by pseudonymized
personal identifiers. As the salted hashes performed at the data
station are unknown to the TSE, it is not able to reverse or decrypt
https://doi.org/10.6084/m9.figshare.7379810.v2.
376 C. Sun et al. / A Privacy-Preserving Infrastructure for Analyzing Personal Health Data in a Vertically Partitioned Scenario