Nothing Special   »   [go: up one dir, main page]

Shti 264 Shti190246

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/335381560

A Privacy-Preserving Infrastructure for Analyzing Personal Health Data in a


Vertically Partitioned Scenario

Article · August 2019


DOI: 10.3233/SHTI190246

CITATIONS READS

0 124

14 authors, including:

Lianne Ippel Birgit Wouters


Maastricht University Maastricht University
12 PUBLICATIONS   13 CITATIONS    6 PUBLICATIONS   26 CITATIONS   

SEE PROFILE SEE PROFILE

Alexander Malic Onaopepo Adekunle


Maastricht University African University of Science and Technology
6 PUBLICATIONS   15 CITATIONS    2 PUBLICATIONS   0 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Healthoutcome View project

FAIR Data View project

All content following this page was uploaded by Lianne Ippel on 10 October 2019.

The user has requested enhancement of the downloaded file.


MEDINFO 2019: Health and Wellbeing e-Networks for All 373
L. Ohno-Machado and B. Séroussi (Eds.)
© 2019 International Medical Informatics Association (IMIA) and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
doi:10.3233/SHTI190246

A Privacy-Preserving Infrastructure for Analyzing Personal Health Data in a Vertically


Partitioned Scenario
Chang Suna, Lianne Ippela, Johan van Soestb, Birgit Woutersa, Alexander Malica, Onaopepo Adekunlea,
Bob van den Bergc, Ole Mussmannc, Annemarie Kosterd, Carla van der Kallene, Claudia van Oppena,
David Townendf, Andre Dekkerb, Michel Dumontiera
a
Institute of Data Science, Maastricht University, Maastricht, The Netherlands,
b
Department of Radiation Oncology (MAASTRO), GROW School for Oncology and Developmental Biology, Maastricht University
Medical Centre+, Maastricht, The Netherlands,
c
Statistics Netherlands (Centraal Bureau voor de Statistiek), Heerlen, The Netherlands,
d
Department of Social Medicine, CAPHRI Care and Public Health Research Institute, Maastricht University, The Netherlands,
e
Department of Internal Medicine, CARIM School for Cardiovascular Diseases, Maastricht University, Maastricht, The Netherlands,
f
Department of Health, Ethics and Society, CAPHRI Research School, Maastricht University, Maastricht, The Netherlands

Abstract privacy, security, and confidentiality [9]. Such considerations


are particularly crucial when use and analysis of health data
It is widely anticipated that the use and analysis of health-related
involve multiple legal entities, different data standards, a lack
big data will enable further understanding and improvements in
of detailed provenance, and unclear access authorization
human health and wellbeing. Here, we propose an innovative
procedures.
infrastructure, which supports secure and privacy-preserving
analysis of personal health data from multiple providers with Another significant challenge lies in the analysis of personal
different governance policies. Our objective is to use this health data from multiple sources. The simplest case is where
infrastructure to explore the relation between Type 2 Diabetes data are horizontally partitioned, such that data about different
Mellitus status and healthcare costs. Our approach involves the sets of individuals are located in different sites. Analyzing these
use of distributed machine learning to analyze vertically distributed data is relatively well understood and reduces to
partitioned data from the Maastricht Study, a prospective combining a set of models from each site. A more challenging
population-based cohort study, and data from the official case is where data are vertically partitioned: different attributes
statistics agency of the Netherlands, Statistics Netherlands about a particular individual are distributed over a set of data
(Centraal Bureau voor de Statistiek; CBS). This project seeks an sources. While in the case of horizontally partitioned data
optimal solution accounting for scientific, technical, and analytical results are combined afterwards, this is not possible in
ethical/legal challenges. We describe these challenges, our the vertically partitioned case since none of the data providers
progress towards addressing them in a practical use case, and a can execute the complete analysis independently of the other
simulation experiment. providers. This is particularly challenging either when there is a
legal impediment to link records across data providers with a
Keywords:
unique identifier or when this unique identifier is unavailable.
Health Information Systems, Data Science, Machine Learning Addressing this challenge effectively requires a great level of
technical sophistication to simultaneously address legal and/or
Introduction privacy constraints.
Instead of centralizing the data for the analysis, one could use
A growing amount of personal health data are being collected by distributed learning methods, which operate over vertically
a variety of entities, such as healthcare providers, insurance partitioned data. In such a scenario, data-processing algorithms
companies, and wearable device manufacturers. Use of personal are sent to each site, and can only return the results of an
health data such as health status, current and prior medications, analysis rather than any of the original data. One such
lifestyle and behavior offers unprecedented opportunities to infrastructure is the Personal Health Train (PHT) [10,11],
augment our understanding of human health and disease. This which sends applications (the trains) containing algorithms to
contributes to improved diagnostic accuracy and efficiency [1,2], the data sources (the stations). The station can inspect whether
and facilitates the transition to preventive [3,4] and precision the train is allowed to execute the application on (a subset of) the
medicine [5–7]. Moreover, the analysis of health data can help available data. The PHT empowers data subjects with more
governments pursue effective health policies while minimizing control (who can access the data?) and transparency (what are
healthcare costs. Such innovation arises from the secondary use the trains requesting?). Hence, the PHT facilitates authorized
of health data for research. algorithmic processing in a secure manner at multiple data sites
However, a major barrier to research lies in the difficulty of without requiring a transfer of (original) data to a centralized
accessing and analyzing health data that are dispersed in both location. Moreover, the PHT implements privacy-by-design in
their form (e.g. medical records, consumer activity, and social the following ways: 1) it can restrict which data elements are
media), representation (structured, semi-structured, and/or available to an application, 2) it can restrict the results of the
unstructured), and stewardship (who is responsible for data analysis to only processed data, rather than original data, and 3)
collection and governance?). While many methods to represent no data party can see the data of other parties in the network.
and exchange healthcare data have been developed [8], there has Here, we describe an implementation of the PHT that uses a
been a lack of focus on legal-ethical concerns such as data Trusted Secure Environment (TSE) to analyze vertically
ownership and data stewardship as well as issues relating to
374 C. Sun et al. / A Privacy-Preserving Infrastructure for Analyzing Personal Health Data in a Vertically Partitioned Scenario

partitioned data that are prepared in line with the FAIR principles consent) and purpose for which each data provider obtained the
(Findable, Accessible, Interoperable, Reusable) [12]. By personal data, and is further analyzing the legal basis and purpose
describing data using the FAIR principles, the infrastructure for which secondary processing can occur. Options that are being
becomes ambivalent to certain syntactic data structures (e.g. considered include but are not limited to the route of compatible
OHDSI, CDISC-ODM or HL7 v2/v3/FHIR), as the applications, processing and the route of scientific research in the public
executed at the data source, should be able to interpret different interest. Additionally, there are a number of limitations from the
types of data structures. To test the feasibility of this data providers themselves regarding accessing, sharing, and
infrastructure, we combine data from two independent data linking data. In addition, for this challenge, a legal framework
providers to investigate how Type 2 Diabetes Mellitus (T2DM) has to be formulated in order to establish collaborations between
status affects healthcare cost. The first dataset comes from the the data providers, among themselves and with the research team.
Maastricht Study1, an observational prospective population- Constructing this legal framework and finding the proper legal
based cohort study focusing on the etiology of T2DM, and the basis for the researchers’ team is a valuable contribution from the
second comes from the official statistics office in the ELSI team.
Netherlands: Statistics Netherlands 1 (Centraal Bureau voor de Technical Perspective
Statistiek; CBS). We present preliminary results involving
Following the PHT architecture2, we use the concepts of (FAIR
simulated data and discuss the challenges and feasibility of such
data) stations3, rails (infrastructure) and (applications) trains. The
an infrastructure to be scalable and secure.
minimal requirement of a FAIR data station is to enable
execution of applications, where data providers decide whether
Methods to execute the application. These FAIR data stations are based on
Semantic Web technologies such as the Resource Description
In this section, we describe the development of our proposed Framework (RDF) [19], to convert the source data4, and make
infrastructure from a scientific, technical, and legal perspective the converted data FAIR.
to support the workflow. Following is the description of our
Application (train) developers (i.e., researchers) can create the
simulation experiment to test the usability of our infrastructure.
application trains using Docker containers [20], which are
Development Workflow lightweight virtual machines. The Docker container carries all
required software packages to execute the application on board.
The PHT architecture has been previously used to analyze
These applications can for instance query data available in the
horizontally partitioned datasets [13–16]. Here, we extend this
data station, perform data cleaning/formatting, and execute
work to include vertically partitioned data. While several studies
machine learning or statistical analysis [15]. Only the results of
discuss exchanging and analyzing vertically partitioned data
these (analytical) applications are sent back to the application
[17,18], these are largely theoretical and overlook practical
developers.
challenges, e.g. legal and ethical considerations, incompatible
data management standards, scalability of the infrastructure, lack To implement the proposed infrastructure, we created three
of financial support to sustain such efforts, and the technical stations. Two FAIR data stations are at the Maastricht Study and
requirements of learning from vertically partitioned data. To at CBS. A third station was configured as a “Trusted Secure
tackle these challenges, our team has established three Environment” (TSE), containing no data by itself, however,
interlocking work packages that target: i) the scientific questions acting as a trusted and independent entity. Additionally, we
in the medical domain; ii) the ethical, legal, and societal issues; created two application trains. The first application train extracts
and iii) the technical aspect. These packages are highly the data from two data stations, pseudonymizes the personal
intertwined to ensure the development of practical solutions. identifiers, encrypts the dataset, and sends the data to the TSE
station. The second application train decrypts the data and
Scientific Perspective
analyzes the data at the TSE. For every execution, both
To develop infrastructure that is useful to scientific researchers, application trains are configured for proper encryption and
we have identified key research questions that the infrastructure security measures.
should help answer. Answering these research questions should
require the combination of sensitive (non-public) data from Experiment Design
multiple providers. To combine data from multiple providers, a Prior to feeding our infrastructure with real data, we conducted a
substantive set of individuals should be shared by the providers simulation experiment with two scenarios where researchers
and at least some attributes of these individuals are present in combine data from two independent providers using a TSE
both datasets to enable linking of the data records (and not station. We monitor time to obtain the analytical results for each
necessarily by some specific individual identifier). scenario. Scenario 1 consists of two providers, A and B, each
ELSI Perspective having the same (small) number of individuals; Scenario 2
The Ethical, Legal, and Societal Issues (ELSI) team deals with consists of providers A and B, but provider B has a much larger
two types of challenges: i) privacy concerns that arise from the set of individuals, including all Provider A’s individuals. For
special nature of personal health data3; and ii) the legal challenges these scenarios, we use data from a publicly available dataset
that arise from working with multiple data providers with each a which contains attributes that could be interpreted as sex, body
distinct governance framework. Combining data from multiple mass index (BMI), number of children, smoking status, region,
parties is a relatively new phenomenon, and often not foreseen and health insurance reimbursement of participants [21].
when establishing the legal framework when the data are Additionally, we generated artificial personal identifiers
collected. Therefore, one of the major challenges has been to including date of birth, zip code, house number, and sex for
facilitate this study whilst adhering to the original legal linking purpose [22]. In practice, combining multiple datasets
framework and defined purpose. In doing so, the ELSI team has might be prone to record-linking errors. We will discuss this in
examined the reach of the original legal basis (i.e. informed more detail in the Discussion section. Please find this synthetic

1 3
Statistics Netherlands is a Dutch governmental institution that gathers FAIR stations: http://github.com/maastroclinic/DataFAIRifier
4
statistical information about the Netherlands: https://www.cbs.nl/en-gb Convert CSV file to RDF file:
2 PHT architecture: https://bitbucket.org/jvsoest/pytaskmanager.git https://github.com/sunchang0124/FAIRHealth/
C. Sun et al. / A Privacy-Preserving Infrastructure for Analyzing Personal Health Data in a Vertically Partitioned Scenario 375

dataset in Figshare5. This dataset is vertically split over the two sensitive data such as personal identifiers. Thus, in addition to
providers: both have artificial personal identifiers (date of birth, pseudonymization and encryption, the privacy of information is
zip code, house number, and sex). Only Provider A has BMI, further protected as no data provider has direct access to the TSE.
number of children, and smoking status, while only Provider B After executing the analytical algorithms on the merged dataset,
has living region and health insurance reimbursement. In the TSE checks whether the results reveal any personal
scenario 1, both providers have 1338 patients. In scenario 2, identifiable information. Only the validated results such as
Provider A still has 1338 patients while Provider B hosts 64,400 figures and/or tables that do not contain any personal identifiable
patients. Since, Provider A in the second scenario only hosts a information are returned to the researchers. Finally, all (received
small subset of Provider B, a single record of Provider A might and created) data in the TSE are destroyed.
match with several records from Provider B. Even though this
scenario is often encountered in practice, few solutions are
available to address this linking challenge for vertically
partitioned data [23].
For our experiment, we developed application trains using
Docker 18.03.1. Pseudonymization, encryption, verification, and
record linkage were implemented in Python 2.7. The
infrastructure was tested with a 2.5GHz PC with 16GB RAM and
500 GB hard disk.

Result
In this section, we detail the contributions of each of the three
work packages. Next, we discuss the outcome of the experiment.
Figures 1 and 2 provide an illustration of the infrastructure. In
Figure 1, an overview of the operational framework for two Figure 1 - Conceptual overview of the proposed infrastructure.
providers, A and B, and a trusted secure environment, TSE, is Data access is regulated by the data provider hosting the
presented. In Figure 2, we present the technical and legal stations. If access is granted, the data providers encrypt the
requirements of the FAIR data stations. Researchers request data and send these to the TSE. The TSE executes the
permission to access and process data from the data provider. researchers’ application and allows aggregated results to be
Once permission is granted, application trains to pseudonymize returned to the researcher.
and encrypt the data are sent and executed in the data stations.
Next, the encrypted data are sent to the TSE, followed by the data
analysis application (from the researchers).
In the FAIR data stations (Figure 2), personal identifiers are
pseudonymized by one-way hashing and salting techniques.
One-way hashing turns any format of data into a fixed-length
"fingerprint" that cannot be reversed. Salt, as a random string, is
appended to data before hashing, to eliminate the risk of
malicious decryption. We used Secure Hash Algorithm 2 (SHA-
512) as the one-way hashing function and random salts are shared
by two data providers to make personal identifiers
pseudonymized on both sites. This results in a unique code per
record, allowing linking the same records from all data providers.
Every time data providers grant researchers permission to
process/analyze the data, the personal identifiers get Figure 2 - Overview of data stations and application trains.
pseudonymized using different salts. The salt needs to be created Within each station, data are prepared, i.e., legal conditions are
and agreed upon by all data providers. Additionally, to safeguard checked, FAIR principles implemented, personal identifiers
secure transfer, processed data are encrypted, prior to sending pseudonymized, and encrypted. The application train enters the
them to TSE. The same as with the salt, encryption keys are re- data station with algorithms and leaves with results or
generated every time. processed data.
The procedure then continues as follows: when the encrypted
data are sent to the TSE, a notification is generated by the data Simulation Experiment
stations to confirm the successful execution and departure of the We used our proposed infrastructure to analyze synthetic data
train. After all encrypted data arrive at the TSE station, the (discussed in the Methods section) that was vertically partitioned
researchers trigger analysis at the TSE with a set of keys and an to form two datasets, each with a different data provider. Figure
application that includes code for the analysis. There is one 3 shows one such result: a plot of BMI and health insurance
private key per data station to decrypt the dataset, and one reimbursement over one calendar year. While simple, the
verification key to test the dataset integrity. The data station can simulation experiment provides evidence for the feasibility of the
only encrypt using the public key but cannot decrypt. The TSE infrastructure to execute an analysis, in this case, retrieval of a
station maintains the private key to decrypt for this specific data relation between two attributes in separate datasets in a secure
provider. After getting verified and decrypted data from both and privacy-preserving manner.
providers, the data can be linked and merged by pseudonymized
personal identifiers. As the salted hashes performed at the data
station are unknown to the TSE, it is not able to reverse or decrypt

5 Find our synthetic datasets:

https://doi.org/10.6084/m9.figshare.7379810.v2.
376 C. Sun et al. / A Privacy-Preserving Infrastructure for Analyzing Personal Health Data in a Vertically Partitioned Scenario

provider scenario, we anticipate that it can be extended to more


than two providers. While the performance of this system will
depend on the size of the data and the algorithms used for
encryption, merging, and analysis, we believe that the biggest
bottleneck is in creating consortium agreements and deploying
the infrastructure in individual facilities. As such, there is a need
to further develop a “PHT” deployment kit that enables
stakeholders to consider all the issues and options and make
informed decisions in the most efficient manner. A second
challenge in our implementation lies in the possibility of errors
caused by linking vertically partitioned datasets. The accuracy of
matching across these will decrease owing to missing data,
typographical errors, differences in pseudonymization
procedures, and different formats of identifying information. In
addition, to match a fraction of records from multiple large
datasets, the data providers could limit the size of their data by
Figure 3. Plot of body mass index (BMI) versus health sending only a selection to TSE. This selection can be discovered
insurance reimbursement in the past year (dollars) from the and defined by sending exploratory or individual selection
analysis of a synthetic and vertically partitioned dataset using algorithms first. For instance, in our case, instead of sending the
the proposed infrastructure. information of the entire Dutch population to the TSE, only a
subset of the Dutch population which meets the criteria of the
We conducted an experiment with two scenarios. In the first Maastricht Study sample is sent to the TSE. However, note that
scenario, where both providers host 1338 individuals, this selection might also leak information about the individuals
pseudonymization took 0.4 - 0.5 seconds and encryption took 0.1 in the data of (one or more) data providers. We intend to explore
- 0.2 seconds for each data station. At the TSE, verification and the impact of such aspects in future studies. A third challenge is
decryption spent merely 0.1 - 0.2 seconds, while record linkage how to manage and transport the keys securely among different
took around 7.2 seconds. For the second scenario, where provider parties. The TSE requires decryption and verification keys to
A hosts 1338 individuals and Provider B 64,400 individuals, decrypt the data and run the analysis algorithms. This approach
pseudonymization for Provider B took 7.3 seconds and 2.5 must be agreed on by all parties from both technical and ethical-
seconds to encrypt. The total time cost at the TSE increased to legal perspectives.
about 15 seconds. From our experiment, we found that
pseudonymization (at data stations) and record linkage (at the Conclusions
TSE) consumed approximately 80% of the running time. Future
work will focus on operational performance measures, and To analyze vertically partitioned data, we extended a Personal
among others, the size of provider datasets and number of Health Train (PHT) infrastructure to send data analysis
attributes considered in linking. algorithms to multiple data stations and return only the results
instead of the original data to the researchers.
Discussion This infrastructure was developed in a coordinated manner across
multiple scientific, technical, ethical, legal, and societal aspects
We have described and demonstrated a distributed learning involving several units and organizations. This coordination
infrastructure using artificial and vertically partitioned data across interests is essential to explore viable solutions for data
involving two providers and a trusted secure environment. This sharing and reuse, as envisioned by the proponents of the FAIR
is a preliminary, but promising result. principles. In particular, the idea of bringing the algorithm to the
Our long-term goal is to deploy the infrastructure to analyze data, rather than obtaining consent to receive a copy of the data,
actual data from two independent organizations - Statistics offers an entirely new paradigm that has not been considered by
Netherlands and the Maastricht Study. Thus far, we have most organizations. Having a new paradigm will require
requested data for 3451 consenting participants from the stakeholders to take the time and effort to thoroughly evaluate
Maastricht Study, which is characterized by extensive this in terms of their legal and technical requirements. However,
phenotyping and provides information on the etiology, as our experiment shows, it offers a more scalable and secure
pathophysiology, complications, and comorbidities of T2DM. solution to analyze vertically partitioned data in a secure and
All participants are aged between 40 and 75 years and live in the privacy-preserving manner. Additional operational and security
southern part of The Netherlands. We requested those attributes enhancements are still needed before the infrastructure is suited
which were complete and consented. Attributes include socio- to deal with real (sensitive) data. Future work will explore the
demographic factors, lifestyle factors, the status of T2DM, quality of scientific discovery (accuracy of outcome), the
physical function, mental functions, BMI, and cardiovascular security, scalability, sustainability, and performance of
disease history. From CBS, we requested regional population computation. While no solution will be perfect for all situations,
data of health insurance reimbursement available at Statistics we believe that this adaptation of the PHT model will find utility
Netherlands. As of November 2018, all application trains have in situations involving sensitive data with a multitude of
been developed. We are in the stage of approving and building stakeholders.
data stations for the Maastricht Study and CBS. A joint
controllership agreement between the two organizations is Acknowledgements
established to enable the TSE. We are preparing analytic
algorithms that will 1) answer scientific questions regarding the This project is funded by the Dutch National Research Agenda
associations between T2DM status and healthcare costs, and 2) (NWA; project number: 400.17.605) and the provincial funding
to evaluate the performance and security of our infrastructure. for the Limburg Meet (LIME) project. We thank our partners
Applying the infrastructure to real-world situations will present Statistics Netherlands (CBS) and the Maastricht Study (DMS)
several challenges. Although we have only explored a two-data- for contributing knowledge and support.
C. Sun et al. / A Privacy-Preserving Infrastructure for Analyzing Personal Health Data in a Vertically Partitioned Scenario 377

References [14] A. Damiani, M. Vallati, R. Gatta, N. Dinapoli, A. Jochems,


T. Deist, J. van Soest, A. Dekker, and V. Valentini,
[1] G. Dougherty, Digital Image Processing for Medical Distributed Learning to Protect Privacy in Multi-centric
Applications, (n.d.) 485. Clinical Studies, in: J.H. Holmes, R. Bellazzi, L. Sacchi, and
[2] H. Elshazly, A.T. Azar, A. El-korany, and A.E. Hassanien, N. Peek (Eds.), Artificial Intelligence in Medicine, Springer
Hybrid system for lymphatic diseases diagnosis, in 2013 International Publishing, 2015: pp. 65–75.
International Conference on Advances in Computing, http://link.springer.com/chapter/10.1007/978-3-319-19551-
Communications and Informatics (ICACCI), IEEE, Mysore, 3_8 [accessed June 25, 2015].
2013: pp. 343–347. doi:10.1109/ICACCI.2013.6637195. [15] T.M. Deist, A. Jochems, J. van Soest, G. Nalbantov, C.
[3] E.A. Clarke, What is Preventive Medicine?, Can Fam Oberije, S. Walsh, M. Eble, P. Bulens, P. Coucke, W. Dries,
Physician. 20 (1974) 65–68. A. Dekker, and P. Lambin, Infrastructure and distributed
[4] I. Barbier-Feraud, J.B. Malafosse, P. Bouexel, C. learning methodology for privacy-preserving multi-centric
Commaille-Chapus, A. Gimalac, G. Jeannerod, M. Léo, B. rapid learning health care: euroCAT, Clinical and
Leroy, B. Nordlinger, M. Paoli, I. Assistance, P. Prados, and Translational Radiation Oncology. 4 (2017) 24–31.
J.-Y. Robin, Big data and prevention from prediction to doi:10.1016/j.ctro.2016.12.004.
demonstration, (n.d.) 80. [16] A. Jochems, T.M. Deist, J. van Soest, M. Eble, P. Bulens,
[5] R.B. Stricker, and L. Johnson, Lyme disease: the promise of P. Coucke, W. Dries, P. Lambin, and A. Dekker, Distributed
Big Data, companion diagnostics and precision medicine, learning: Developing a predictive model based on data from
Infect Drug Resist. 9 (2016) 215–219. multiple hospitals without data leaving the hospital – A real
doi:10.2147/IDR.S114770. life proof of concept, Radiotherapy and Oncology. 121
[6] J.S. Beckmann, and D. Lew, Reconciling evidence-based (2016) 459–467. doi:10.1016/j.radonc.2016.10.002.
medicine and precision medicine in the era of big data: [17] J. Vaidya, A Survey of Privacy-Preserving Methods
challenges and opportunities, Genome Medicine. 8 (2016). Across Vertically Partitioned Data, in: C.C. Aggarwal, and
doi:10.1186/s13073-016-0388-7. P.S. Yu (Eds.), Privacy-Preserving Data Mining, Springer
[7] S. Dolley, Big Data’s Role in Precision Public Health, Front US, Boston, MA, 2008: pp. 337–358. doi:10.1007/978-0-
Public Health. 6 (2018). doi:10.3389/fpubh.2018.00068. 387-70992-5_14.
[8] E.H. Shortliffe, and J.J. Cimino, eds., Biomedical [18] T.Z. Gál, G. Kovács, and Z.T. Kardkovács, Survey on
Informatics: Computer Applications in Health Care and privacy preserving data mining techniques in health care
Biomedicine, 4th ed., Springer-Verlag, London, 2014. databases, Acta Universitatis Sapientiae, Informatica. 6
//www.springer.com/us/book/9781447144731 [accessed (2014) 33–55. doi:10.2478/ausi-2014-0017.
November 8, 2018]. [19] RDF 1.1 Concepts and Abstract Syntax, (n.d.).
[9] M.J. Bietz, C.S. Bloss, S. Calvert, J.G. Godino, J. Gregory, https://www.w3.org/TR/rdf11-concepts/ [accessed March
M.P. Claffey, J. Sheehan, and K. Patrick, Opportunities and 22, 2019].
challenges in the use of personal health data for health [20] What is a Container, Docker. (2018).
research, Journal of the American Medical Informatics https://www.docker.com/resources/what-container [accessed
Association. 23 (2016) e42–e48. doi:10.1093/jamia/ocv118. October 22, 2018].
[10] van S. Johan, S. Chang, M. Ole, P. Marco, van den B. [21] B. Lantz, Machine learning with R: learn how to use R to
Bob, M. Alexander, van O. Claudia, T. David, D. Andre, apply powerful machine learning methods and gain an
and D. Michel, Using the Personal Health Train for insight into real-world applications, Packt Publ,
Automated and Privacy-Preserving Analytics on Vertically Birmingham, 2013.
Partitioned Data, Studies in Health Technology and [22] Welcome to Faker’s documentation! — Faker 1.0.2
Informatics. (2018) 581–585. doi:10.3233/978-1-61499- documentation, https://faker.readthedocs.io/en/master/
852-5-581. [accessed March 22, 2019].
[11] Personal Health Train, Dutch Techcentre for Life Sciences. [23] S. Hardy, W. Henecka, H. Ivey-Law, R. Nock, G. Patrini,
(n.d.). https://www.dtls.nl/fair-data/personal-health-train/ G. Smith, and B. Thorne, Private federated learning on
[accessed October 22, 2018]. vertically partitioned data via entity resolution and
[12] M.D. Wilkinson, M. Dumontier, Ij.J. Aalbersberg, G. additively homomorphic encryption, ArXiv:1711.10677
Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, [Cs]. (2017). http://arxiv.org/abs/1711.10677 [accessed
L.B. da Silva Santos, P.E. Bourne, J. Bouwman, A.J. September 18, 2018].
Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S.
Edmunds, C.T. Evelo, R. Finkers, A. Gonzalez-Beltran, Address for Correspondence
A.J.G. Gray, P. Groth, C. Goble, J.S. Grethe, J. Heringa,
P.A.. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S.J. Chang Sun, Universiteitssingel 60, 6229ER Maastricht,
Lusher, M.E. Martone, A. Mons, A.L. Packer, B. Persson, P. chang.sun@maastrichtuniversity.nl, +31 6 36373387.
Rocca-Serra, M. Roos, R. van Schaik, S.-A. Sansone, E.
Schultes, T. Sengstag, T. Slater, G. Strawn, M.A. Swertz, M.
Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A.
Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, and
B. Mons, The FAIR Guiding Principles for scientific data
management and stewardship, Scientific Data. 3 (2016)
160018.
[13] J.P.A. van Soest, A.L.A.J. Dekker, E. Roelofs, and G.
Nalbantov, Application of Machine Learning for
Multicenter Learning, in: I. El Naqa, R. Li, and M.J.
Murphy (Eds.), Machine Learning in Radiation Oncology,
Springer International Publishing, Cham, 2015: pp. 71–97.
doi:10.1007/978-3-319-18305-3_6.

View publication stats

You might also like