Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care

Systematic Review of Privacy-Preserving
review articles Distributed Machine Learning From Federated

Databases in Health Care
Fadila Zerka, PhD1,2; Samir Barakat, MSc, PhD1,2; Sean Walsh, MSc, PhD1,2; Marta Bogowicz, PhD1,3; Ralph T. H. Leijenaar, MSc, PhD1,2;
Arthur Jochems, PhD1; Benjamin Miraglio, PhD2; David Townend, LLB, MPhil, PhD4; and Philippe Lambin, MD, PhD1
abstract
Big data for health care is one of the potential solutions to deal with the numerous challenges of health care, such
as rising cost, aging population, precision medicine, universal health coverage, and the increase of non-
communicable diseases. However, data centralization for big data raises privacy and regulatory concerns.
Covered topics include (1) an introduction to privacy of patient data and distributed learning as a poten-
tial solution to preserving these data, a description of the legal context for patient data research, and
a definition of machine/deep learning concepts; (2) a presentation of the adopted review protocol; (3)
a presentation of the search results; and (4) a discussion of the findings, limitations of the review, and future
perspectives.
Distributed learning from federated databases makes data centralization unnecessary. Distributed algorithms
iteratively analyze separate databases, essentially sharing research questions and answers between databases
instead of sharing the data. In other words, one can learn from separate and isolated datasets without patient
data ever leaving the individual clinical institutes.
Distributed learning promises great potential to facilitate big data for medical application, in particular for
international consortiums. Our purpose is to review the major implementations of distributed learning in health
care.
JCO Clin Cancer Inform 4:184-200. © 2020 by American Society of Clinical Oncology
Licensed under the Creative Commons Attribution 4.0 License
INTRODUCTION The methodology for this research is that distributed

Law and ethics seek to produce a governance machine learning is an evolving field in computing,
framework for the processing of patient data that with 665 articles published between 2001 and 2018;
produces a solution to the issues that arise between the study is based on a literature search, focuses
the competing desires of individuals in society for on the medical applications of distributed machine
privacy and advances in health care. Traditional safe- learning, and provides an up-to-date summary of
guards to achieve this governance have come from, the field.
for example, the anonymization of data or informed
THE LEGAL CONTEXT FOR PATIENT DATA RESEARCH
consent. These are not adequate safeguards for the
ASSOCIATED new big data and artificial intelligence methodologies in The challenges in law and ethics in relation to big data
CONTENT research; it is increasingly difficult to create anonymous and artificial intelligence are well documented and
Appendix data (rather than pseudonymized/coded data) or to discussed1-16. The issue is one of balance: privacy of
Author affiliations maintain it against re-identification (through linking of health data and access to data for research. This issue
and support is likely to become more pronounced with the fore-
datasets causing accidental or deliberate re-identifi-
information (if
cation). The technology of big data and artificial in- seeable developments in health care, notably in re-
applicable) appear at
the end of this telligence, however, itself increasingly offers safeguards lation to rising cost, aging population, precision
article. to solve the governance problem. In this article we medicine, universal health coverage, and the increase
Accepted on January explore how privacy-preserving distributed machine of noncommunicable diseases. However, recent
16, 2020 and learning from federated databases might assist gover- developments in law, for example, in the European
published at
nance in health care. The article first outlines the basic Union’s General Data Protection Regulation (GDPR),
ascopubs.org/journal/
cci on March 5, 2020:
parameters of the law and ethics issues and then dis- appear to maintain the traditional approach that seems
DOI https://doi.org/10. cusses machine learning and deep learning. Thereafter, to favor individualism above solidarity. Individualism
1200/CCI.19.00047 the results of the review are presented and discussed. is strengthened in the new legislation. There is
184
Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240
Copyright © 2020 American Society of Clinical Oncology. All rights reserved.
Distributed Learning in Health Care
CONTEXT
Key Objective
Review the contribution of distributed learning to preserve data privacy in health care.
Knowledge Generated
Data in health care are greatly protected; therefore, accessing medical data is restricted by law and ethics. This restriction has
led to a change in research practice to adapt to new regulations. Distributed learning makes it possible to learn from medical
data without these data ever leaving the medical institutions.
Relevance
Distributed learning allows learning from medical data while guaranteeing preservation of patient privacy.
a narrowing of the definition of informed consent in Article MACHINE LEARNING

4.11 of the GDPR, with the unclear inclusion of the ne- Machine learning comes from the possibility to apply al-
cessity for broad consent in scientific research included in gorithms on raw data to acquire knowledge.1 These algo-
Recital 33. rithms are implemented to support decision making in
In relation to the continuing ambiguity of the unclear legal different domains, including health care, manufacturing,
landscape for research using and reusing large datasets education, financial modeling, and marketing.2,3 In medical
and linking between datasets, the GDPR is not clear in the disciplines, machine learning has contributed to improving
area of re-identification of individuals. For the GDPR, part of the efficiency of clinical trials and decision-making pro-
the problem is clear—when data have the potential when cesses. Some examples of machine learning applications in
added to other data to identify an individual, then those data medicine are the localization of thoracic diseases,4 early
are personal data and subject to regulation. The question is, diagnosis of Alzheimer disease,5 personalized treatment,6
is this absolute (any possibility, regardless of remoteness), outcome prediction,7,8 and automated radiology reports.9
or is there a reasonableness test? Recital 26 includes such
There are three main categories of machine learning al-
a reasonable test: “To ascertain whether means are rea-
gorithms. First, in supervised learning, the algorithm gen-
sonably likely to be used to identify the natural person,
erates a function for mapping input variables to output
account should be taken of all objective factors, such as
variables. In unsupervised learning, the applied algorithms
the costs of and the amount of time required for iden-
do not have any outcome variable to estimate, and the
tification, taking into consideration the available tech-
algorithms generate a function mapping for the structure of
nology at the time of the processing and technological
the data. The third type is referred to as reinforcement
developments.”16a
learning, whereby in the absence of a training dataset the
From this overview of legal difficulties, it is clear that there algorithm trains itself by learning from experiences to make
are obstacles to processing data in big data, machine increasingly improved decisions. A reinforcement agent
learning, and artificial intelligence methodologies and en- decides what action to perform to accomplish a given
vironments. It must be stressed that the object is not to task.10,11 Table 1 provides a brief description of selected
circumvent the rights of patients or to suggest that privacy popular machine learning algorithms across the three
should be ignored. The difficulty is that where the law is categories.
unclear, there is a tendency toward restrictive readings of
the law to avoid liability, and, in the case of the method- DEEP LEARNING
ologies and applications of data science discussed here, Deep learning is a subset of machine learning, which, in
the effect of unclear law and restrictive interpretations of the turn, is a subset of artificial intelligence,12 as represented in
law will be to block potentially important medical and Figure 1. The learning process of a deep neural network
scientific developments and research. Each of the un- architecture cascades through multiple nodes in multiple
certainties will require regulators to take a position on the layers, where nodes and layers use the output of the
best interpretation of the meaning of the law according to previous nodes and layers as input.13 The output of a node
the available safeguards. The question for the data science is calculated by applying an activation function to the
community is, how far can that community itself address weighted average of this node’s input. As described by
concerns about privacy, about re-identification, and about Andrew Ng14, “The analogy to deep learning is that the
safeguarding autonomy of individuals and their legiti- rocket engine is the deep learning models and the fuel is
mate expectations to dignity in their treatment through the the huge amounts of data that we can feed in to these
proper treatment of their personal data? How far distributed algorithms,” meaning that the more data are fed into the
learning might contribute a suitable safeguard is the model the better the performance. Yet, this continuous
question addressed in the remainder of this paper. improvement of the performance in concordance with the
JCO Clinical Cancer Informatics 185

TABLE 1. Examples of Machine Learning Algorithms
Distributed Algorithm
Example Algorithm Description Available?
Supervised learning SVM An algorithm performing classification tasks by composing hyperplanes in Yes2
a multidimensional space that separates cases of different class labels.
SVM supports both regression and classification tasks and can handle
multiple continuous and categorical variables.2,3
Logistic regression An algorithm used for discrete values estimation tasks based on a given Yes83-85
set of independent variables. In simple words, it predicts the probability
of occurrence of an event by fitting data to a logit function that maps
probabilities with values lying between 0 and 1. Hence, it is also known
as logit regression.81,82
186 © 2020 by American Society of Clinical Oncology

Decision tree A nonparametric method used for both classification and regression Yes86,87
problems. As the name suggests, it uses a tree-like decision model by
splitting a dataset into two or more subsets on the basis of conditional
control statements. It can also be used to visually represent decision-
making processes.10
Random forest Random forest is a method performing classification and regression No
tasks. A random forest is a composition of two initials: forest because it
represents a collection of decision trees, and random because the
forest randomly selects observations and features from which it puts up
Zerka et al
various decision trees. The results are then averaged. Each decision
tree in the forest has access to a random set of the training data,
chooses a class, and the most selected class is then the predicted
class.10
KNN KNN can be used for regression problems; however, it is widely used for Yes88,89
classification problems. In KNN, the assumption is that similar data
elements are close to each other. Given K (positive integer) and a test
observation, KNN first groups the K closest elements to the test
observation. Then, in the case of regression, it returns the mean of the
K labels, or in the case of classification, it returns the mode of the K
labels.10

Unsupervised learning K-means An algorithm mainly used for clustering in data mining. K-means Yes90-93
nomination comes from its functionality, which is partitioning of N

observations to K clusters, where each and every observation is part of
the cluster with the nearest mean.90
Apriori algorithm A classic algorithm in data mining. Used for mining frequent item groups Yes95
and relevant association rules in a transactional database.94
(Continued on following page)
TABLE 1. Examples of Machine Learning Algorithms (Continued)
Distributed Algorithm
Example Algorithm Description Available?
Reinforcement learning MDP Introduced in 1950s,96 MDP is a discrete stochastic control process No
providing a framework for modeling decision making when final
outcomes are ambiguous. Given SO (current state) and S (new state),
the decision process is made by steps. The process has a state at each
step and the decision maker can choose any available action in SO. The
process then moves randomly into S, the new state. The chosen action
JCO Clinical Cancer Informatics

influences the probability of moving from SO to S. In other words, the
next state depends only on the current state, not the previous states,
and the action taken by the decision maker, satisfying the Markovian
property, from which comes the algorithm’s name.97
Q-learning Useful for optimization of action selection of any finite MDP. Q-learning Yes99
algorithm provides agents in a process the ability to know what action to
take under what situation.98
Abbreviations: KNN, K-nearest neighbors; MDP, Markov decision process; SVM, support vector machine.

187
Zerka et al
Distributed Machine Learning

Artificial intelligence Machine programs able A large quantity of training data is required for machine
to imitate human learning to be applied, especially in outcome modeling,
intelligence
where multiple factors influence learning. Provided there
Machine learning are sufficient and appropriate data, machine learning
Algorithms able to learn
from examples typically results in accurate and generalizable models.20,21
However, the sensitivity of the personal data greatly hinders
Deep learning
Set of learning techniques the conventional centralized approach to machine learn-
inspired from biological ing, whereby all data are gathered in a single data store.
neural networks
Distributed machine learning resolves legal and ethical
privacy concerns by learning without the personal data ever
leaving the firewall of the medical centers.22
FIG 1. Relationship between artificial intelligence, machine learning,
and deep learning. The euroCAT23 and ukCAT24 projects are a proof of dis-
tributed learning being successfully implemented into
amount of the data are not correct for traditional machine clinical settings to overcome data access restrictions. The
learning algorithms reaching a steady performance level purpose of the euroCAT project was to predict patient
that does not improve with the increase of the amount of the outcomes (eg, post-radiotherapy dyspnea for patients with
training data.15 lung cancer) by learning from data stored within clinics
without sharing any of the medical data.
METHODS AND MATERIAL SELECTION Distributed Deep Learning
A PubMed search was performed to collect relevant studies Training a deep learning model typically requires thou-
concerning the utilization of distributed machine learning in sands to millions of data points and is therefore compu-
medicine. We used the search strings: “distributed learn- tationally expensive as well as time consuming. These
ing,” “distributed machine learning,” and “privacy pre- challenges can be mitigated with different approaches.
serving data mining.” The Preferred Reporting Items for First, because it is possible to train deep learning models
Systematic Reviews and Meta-Analyses (PRISMA) state- in a parallelized fashion,25 using dedicated hardware
ment was adopted to select and compare distributed (graphics processing units, tensor processing units)26 re-
learning literature.16 The PRISMA flow diagram and checklist duces the computational time. Second, as the memory of
are slightly modified and presented in Appendix Figure A1 this dedicated hardware is often limited, it is possible to
and Appendix Table A1, respectively. The last search for divide the training data into subsets called batches. In this
distributed machine learning articles was performed on situation, the training process iterates over the batches,
February 28, 2019. only considering the data of one batch at each iteration.27
SEARCH RESULTS On top of easing the computing burden, using small
batches during training improves the model’s ability to
A total of 127 articles were identified in PubMed using the generalize.28
search query: (“distributed learning” OR “distributed ma-
chine learning” OR “privacy preserving data mining”). Six These approaches address computation challenges but do
papers were screened; a brief summary of each article is not necessarily preserve data privacy. As for machine
presented in Table 2. learning, deep learning can be distributed to protect patient
data.29,30 Moreover, distributed deep learning also im-
DISTRIBUTED LEARNING proves computing performance, as in the case of wireless
Distributed learning ensures data safety by only sharing sensor networks, where centralized learning is inefficient in
mathematical parameters (or metadata) and not the actual terms of both communication and energy.31,32
data or in any instance data that might enable tracking back An example of distributed deep learning in the medical
the patient information (such as patient ID, name, or date domain is that of Chang et al,33 who deployed a deep
of birth). In other words, distributed algorithms iteratively learning model across four medical institutions for image
analyze separate databases and return the same solution classification purposes using three distinct datasets: retinal
as if data were centralized, essentially sharing research fundus, mammography, and ImageNet. The results were
questions and answers between databases instead of compared with the same deep learning model trained on
data.17 Also, before processing with the learning process, centrally hosted data. The comparison showed that the
researchers must make sure all data have been success- distributed model accuracy is similar to the centrally hosted
fully anonymized and secured by means of hashing al- model.33 In a different study, McClure et al34 developed
gorithms and semantic web techniques, respectively, as a distributed deep neural network model to reproduce
can be seen in Figure 2, in addition to post-processing FreeSurfer brain segmentation. FreeSurfer is an open
methods to address the multicenter variabilities.19 source tool for preprocessing and analyzing (segmentation,

TABLE 2. Summary of Methods and Results of Distributed Machine Learning Studies Grouping More Than One Health Care Center
Reference Data and Target Methods and Distributed Learning Approach Tools Accomplishments and Results
Jochems61 Clinical data from 287 patients with lung cancer, A Bayesian network model is adapted for Varian learning portal AUC, 0.61
treated with curative intent with CRT or RT distributed learning using data from five
alone were collected and stored in five different hospitals.
JCO Clinical Cancer Informatics

medical institutes: MAASTRO: (the
Netherlands), Jessa (Belgium), Liège
(Belgium), Aachen (Germany), and Eindhoven
(the Netherlands).
Target: predict dyspnea. Patient data were extracted and stored locally in
the hospitals. Only the weights were then sent
to the master server.
Deist23 Clinical data from 268 patients with NSCLC from Alternating Direction Method of Multipliers was Varian learning portal AUC, 0.62 for training set and 0.66 for validation
five different medical institutes: Aachen used to learn SVM models. set
(Germany), Eindhoven (The Netherlands),
Hasselt (Belgium), Liège (Belgium), and
Maastricht (the Netherlands).
Target: predict dyspnea grade ≥ 2. The data were processed simultaneously in local
databases. Then, the updated model
parameters were sent to the master machine to
compare and update them and check if the
learning process has converged. The process
is repeated until convergence criteria were
met.
Dluhoš66 258 patients with first-episode schizophrenia and All images were preprocessed: normalized, VBM8 toolbox Joint and meta models had similar classification
222 healthy controls originating from four segmented, and standardized. performance, which was better than
datasets were collected: two datasets from performance of local models.
University Hospital Brno (Czech Republic),
University Medical Center Utrecht (The
Netherlands), and the last dataset originates
from the Prague Psychiatric Center and
Psychiatric Hospital Bohnice.

Target: classification of patients with first-episode Create four local SVM models. Then create MATLAB statistics and machine
schizophrenia multisample models (joint model and meta learning toolbox

model) based on the individual models created
previously. This process was repeated four
times, by setting each time three training
datasets, with remaining one as the validation
set.
189
TABLE 2. Summary of Methods and Results of Distributed Machine Learning Studies Grouping More Than One Health Care Center (Continued)
Reference Data and Target Methods and Distributed Learning Approach Tools Accomplishments and Results
63
Jochems Clinical data from 698 patients with lung cancer, Distributed learning for a Bayesian network using Varian learning portal AUC, 0.662
treated with curative intent with CRT or RT data from three hospitals
alone were collected and stored in two medical
institutes: MAASTRO (Netherlands) and
Michigan University (United States).
Target: prediction of NSCLC 2-year survival after The model used the T category and N category, The discriminative performance of centralized
radiation therapy age, total tumor dose, and WHO performance and distributed models on the validation set
for predictions. was similar.
Brisimi65 Electronic health records from Boston Medical Soft-margin l1-regularized sparse SVM classifier. Not provided AUC, 0.56
Center of patients with at least one heart-
related diagnosis between 2005 and 2010.
The data are distributed between 10 hospitals. Developed an iterative cPDS algorithm for solving
the large-scale SVM problem in a decentralized

fashion. The system then predicted patient’s
hospitalization for cardiac events in upcoming
calendar year.
Target: prediction of heart cardiac events. cPDS converged faster than centralized methods.
Tagliaferri64 227 variables extracted from thyroid cancer data Inferential regression analysis. COBRA framework Thyroid COBRA: based on COBRA-Storage
from six Italian cancer centers. Each has four System. A new software BOA “Beyond
properties: name, form, type of field, and Ontology” supporting two different models:
levels. Cloud-based large database model and
distributed learning model
Target: prediction of survival and toxicity. Learning Analyzer Proxy (module of BOA only in
Zerka et al
distributed mode) sends algorithms directly to

local research proxies, taking back from them
only the results of each iteration step, with no
need to work with shared data in the Cloud
anymore.
Abbreviations: AUC, area under the curve; BOA, Beyond Ontology Awareness; COBRA, Consortium for Brachytherapy Data Analysis; cPDS, cluster Primal Dual Splitting; CRT, chemoradiation; NSCLC,
non–small-cell lung cancer; RT, radiotherapy; SVM, support vector machine.

A Each hospital is responsible

for data preparation B Three centers distributed network
D Transparency: In a blockchain all records are saved and are non
removable
1. Extract local data We would like

We can to learn from
provide more hospital 1’s
Master data! data!
Hospital 1 Hospital 3
Images Forms
Hospital 2
DB Hospital 1
Sending parameters/master
Hospital 2 Sending local model/hospital
2. Data anonymization Learning machine/hospital

New action from
New action from
Hashing algorithm
Master machine
1 0 9
hospital 2
hospital 1
2 8 3 C Distributed learning flow chart of the above network

Step 1
Sending parameters
Step 2
Waiting
Blockchain Blockchain
Start
6 4 9 Master Master
Action recorded Action approved by hospital 1
Hospital 1 Hospital 2 Hospital 3 Hospital 1 Hospital 2 Hospital 3

Waiting Waiting Waiting Learning Learning Learning
3. Semantic web Step 4 Convergence Step 3 Waiting for all

criteria reached? local models
Any new action in the
Master Master
network is added to the old
1 0 9 Blockchain
1 0 9
actions
Blockchain
Hospital 1 Hospital 2 Hospital 3 Hospital 1 Hospital 2 Hospital 3
Waiting Waiting Waiting Sending Sending Sending
2 8 3 local model local model local model
2 8 3
YES
NO
6 4 9
6 4 9 Step 5 Step 6
Master Master
Update parameters based on
F A I R local models
Create final model
FIG 2. Schematic representation of the processes in a transparent distributed learning network. (A) Data preparation steps. (B) Distributed learning network,
which is composed of three hospitals, each of which is equipped with a learning machine that can communicate with a master machine responsible for
sending model parameters and checking convergence criteria. (C) Flowchart of the distributed learning network described in B. (D) Example of an action that
can be tracked by blockchain (designed and implemented according to needs agreed among network members) and keep all network participants aware of
any new activity taken in the network. DB, database; FAIR, findable, accessible, interoperable, reusable.
thickness estimation, and so on) of human brain magnetic model is ready, not only can the network participants use it
resonance images.35 The results demonstrated perfor- to learn from their data, but this learning should be able to
mance improvement on the test datasets. Similar to the be performed locally and under highly private and secure
previous study, a brain tumor segmentation was suc- conditions to protect the model’s output.23
cessfully performed using distributed deep learning across The users of a machine/deep learning model are not
10 institutions (BraTS distribution).36 necessarily the model’s developers. Hence, documentation
In the matter of distributed deep learning, the training and the integration of automated data eligibility tests have
weights are combined to train a final model, and the raw two important assets:
data are never exposed.35,37 In the case of sharing the • The documentation ensures providing a clear view of
locale gradients,25 it might be possible to retrieve estima- what the model is designed for, a technical description of
tions of the original data from these gradients. Training the the model, and its use.
local models on batches may prevent retrieving all the data • The eligibility tests are important to ensure that correct
from the gradients, as these gradients correspond to single input data are extracted and provided before executing
batches rather than all the local data.38 However, setting an the model. In euroCAT,23 a distributed learning expert
optimal batch size needs to be considered25 to assure data installed quality control via data extraction pipelines at
safety and the model’s ability to generalize.28,39,40 every participant point in the network. The pipeline
automatically allowed data records fulfilling the model
PRIVACY AND INTEGRATION OF DISTRIBUTED
training eligibility criteria to be used in the training. The
LEARNING NETWORKS
experts also test the extraction pipeline thoroughly in
Privacy in a distributed learning network addresses three addition to the machine learning testing. However, there
main areas: data privacy, the implemented model’s privacy, were post-processing compensation methods to cor-
and the model’s output privacy. Data privacy is achieved by rect for the variations caused by using different local
means of data anonymization and data never leaving the protocols.19
medical institutions. The distributed learning model can be
secured by applying differential privacy techniques,41 DISCUSSION
preventing leakage of weights during the training, and If one examines oncology, for instance, cancer is clearly
cryptographic techniques.42 These cryptographic tech- one of the greatest challenges facing health care. More than
niques provide a set of multiparty protocols that ensure 16 million new cancer cases were reported in 2017 alone.43
security of the computations and communication. Once the This number climbed to 18.1 million cases in 2018.44 This

Zerka et al
increasing number of cancer incidences45 means that to publish and reuse computational workflows, and to
there are undoubtedly sufficient data worldwide to put define and share scientific protocols as workflow templates.50
machine/deep learning to meaningful work. However, as Such solutions will address emerging concerns about the
highlighted earlier, this requires access to the data and, as nonreproducibility of scientific research, particularly in data
also highlighted earlier, distributed learning enables this in science (eg, poorly published data, incomplete workflow
a manner that resolves legal and ethical concerns. None- descriptions, limited ability to perform meta-analyses, and an
theless, integration of distributed learning into health care is overall lack of reproducibility).51,52 Because workflows are
much slower compared with other fields, which raises the fundamental to research activities, FAIR has broad applica-
question of why this should be. Here, we summarize a set bility, which is vital in the context of distributed learning with
of methodologies to facilitate the adoption of distributed medical data.
learning and provide future directions.
WHY NOT PUBLICLY SHARE MEDICAL DATA?
CURRENT STATE OF MEDICAL DATA STORAGE Some studies were conducted trying to facilitate and secure
AND PREPROCESSING data-sharing procedures to encourage related researchers
Information Communication Technology and organizations to publicly share their data and embrace
transparency,53 by proposing data-sharing procedures and
Every hospital has its own storage devices and architecture.38,39
protocols aiming to harmonize regulatory frameworks and
In this case, the information communication technology
research governance.54,55 Despite the efforts made toward
preparation for distributed learning requires significant
data-sharing globalization, the sociocultural issues sur-
energy, time, and manpower, which can be costly. This
rounding data sharing remain pertinent.56 Large clinical
same process (data acquisition and preprocessing) needs
trials also face limitations in the data collection capabilities
to be repeated for each participating hospital,46-48 and
because of limited data storage capacities and manpower.
subsequently development and adoption of medical data
To retrospectively perform additional analysis, all the par-
standardization protocols need to be developed for this
ticipating centers need to be contacted again, which is time
implementation process.
consuming and delays research.57
Make the Data Readable: Findable, Accessible,
Furthermore, medical institutions prefer not to share patient
Interoperable, Reusable Data Principles
data to ensure privacy protection.58 This is, of course, in no
One way to enable a virtuous circle network effect is to small part about ensuring the trust and confidence of
embrace another community engaged in synergistic ac- patients who display a wide range of sensitivities toward the
tivities (joining a distributed learning network is worthwhile use of their personal data.
if it links to another large network). The Findable, Acces-
ORGANIZATIONAL CHANGE MANAGEMENT
sible, Interoperable, Reusable (FAIR) Guiding Principles for
data management and stewardship have gained sub- The adoption of distributed learning will require a change in
stantial interest, but delivering scientific protocols and organizational management (such as making use of newest
workflows that are aligned with these principles is data standardization techniques and adapting the roles of
significant.49 A description of FAIR principles is repre- employees to more technically oriented tasks, such as data
sented in Figure 3. Technological solutions are urgently retrieval). Provided knowledge and understanding of
needed that will enable researchers to explore, consume, proper change management concepts, health care pro-
and produce FAIR data in a reliable and efficient manner, viders can implement the latter successfully.59 Change
management principles, such as defining a global vision,
networking, and continuous communicating, could facili-
Findable
tate the integration of new technologies and bring up
Descriptive metadata
Persistent identifiers the clinical capabilities. However, this process of change
management can be complicated, because it requires the
Reusable Findable
Accessible involvement of multiple health care centers from different
Right and license Specify what to share countries and continents. This diversity can trigger a fear of
management Accessible Risk management
Usage standards Participant consent loss (one of the major factors of financial decision making),
definition (what can Interoperable management which stems from differences of opinion and regulation,60
and cannot be used) Access status
Reusable and the absence of data standardization, making the
processes of data acquisition and preprocessing harder.
In addition, the lack of knowledge about the new tech-
Interoperable
XML standards, including nology leads to resistance to accept the change and
data documentation innovation.60,61 Therefore, it is important to help health care
organizations understand the need for distributed learn-
FIG 3. Description of findable, accessible, interoperable, reusable ing by explaining the context of the change in terms of
(FAIR) principles. traditional ways of learning to distributed learning and

a long-term vision of the improvements that it can bring, FUTURE PERSPECTIVES

including time and money savings for both hospitals and An automated monitoring system accessible by the part-
patients. This could in turn improve patient lives, in ad- ners or medical centers participating in the distributed
dition to conducting more studies on research databases learning network can promote transparency, traceability,
to consolidate proof of safety and quality of distributed and trust.67 Recent advances of information technology,
models. such as blockchain, can be integrated into a distributed
As can be seen in Table 2, distributed learning has been learning network.68 Blockchain allows trusted partners to
applied to train different models that can predict different visualize the history of the transactions and actions taken
outcomes for a variety of pathologies, including lung in the distributed network. This integration of blockchain
cancer,23,62,63,63a thyroid cancer,64 heart cardiac events,65 should help in easing the resistance to the new distributed
and schizophrenia,66 in addition to the continuous devel- technology among health care workers as it provides both
opment of tools and algorithms facilitating the adoption of provenance and enforceable governance.
distributed learning, such as the variant learning portal, the In 2008, Satoshi Nakamoto69 introduced the concept of
alternating direction method of multipliers algorithm,2 as a peer-to-peer electronic cash system known as Bitcoin.
well as the application of FAIR data principles. The cited Blockchain was made famous as the public transaction
studies provide a proof that distributed learning can ensure ledger of this cryptocurrency.69,70 It ensures security by
patient data privacy and guarantee that accurate models using cryptography in a decentralized, immutable distrib-
are built that are the equivalent of centralized models. uted ledger technology.71 It is easy to manage as it can be
LIMITATIONS OF THE EXISTING DISTRIBUTED made public, whereby any individual can participate, or it
LEARNING IMPLEMENTATIONS can be made private, where all participants are known to
A shared limitation of the studies presented in Table 2 is each other.72 It is an efficient monitoring system, as records
that the number of institutes involved in the distributed cannot be deleted from the chain. By these means,
network is rather small. The size of the network varies from blockchain exceeds its application as a cryptocurrency to
four to 10 institutions. With few medical institutes involved, a permanent trustful tracing system. Figure 4 illustrates
the models were trained using the data of only a few a visual representation of blockchain.
hundred patients. By promoting the use of distributed Boulos et al71 demonstrated how blockchain could be used
learning, it should instead be possible to train the models to contribute in health care: securing patient information
using data from thousands or even millions of patients. and provider identities, managing health supply chains,
The block is broadcasted to

H1 wants to provide The request is represented every party in the network
more data by a block
Master
Master
The training can start using The block then can be added to the chain, Master
the extra data provided which provides an indelible
and transparent record of the requests H1 then is entitled to start providing
by H1
the data to the network
New block Hospital
New block chained with last block in the chain Master computer
Existing block in the chain
FIG 4. Visual representation of blockchain. Adapted from Rennock et al.18

Zerka et al
monetizing clinical research and data (giving patients the these technologies on privacy, and the relationship be-
choice to share), processing claims, detecting fraud, and tween privacy and confidentiality, but there are significant
managing prescriptions (replace incorrect and outdated technical developments for the regulators to consider that
data). In addition to the above-mentioned uses of block- could answer a number of their concerns.
chain, it has been also used to maintain security and
scalability of clinical data sharing,73 secure medical record SUMMARY
sharing,74 prevent drug counterfeiting,75 and secure a pa-
tient’s location.76 Currently, a combination of regulations and ethics makes it
difficult to share data even for scientific research purposes.
It is essential that the use of distributed machine/deep The issues relate to the legal basis for processing and
learning and blockchain be harmonized with the available anonymization. Specifically, there has been reluctance to
security-preserving technologies (ie, continues devel- move away from informed consent as the legal basis for
opment and cybersecurity), which begins at the user levels processing toward processing in the public interest, and
(use strong passwords, connect using only trusted net- there are concerns about the re-identification of individuals
works, and so on) and ends with more complex information where data are de-identified and then shared in aggregated
technology infrastructures (such as data anonymization environments. A solution could be to allow researchers to
and user ID encryption).77 Cybersecurity is a key aspect in train their machine learning programs without the data ever
preserving privacy and ensuring safety and trust among having to leave the clinics, which in this paper we have
patients and health care systems.78 The continuous de- established as distributed learning. This safe practice
velopment or postmarketing surveillance can be seen as makes it possible to learn from medical data and can be
the set of checks and integrations that should occur when applied across various medical disciplines. A limitation to its
a distributed learning network is launched. This practice application, however, is that medical centers need to be
should make it possible to identify any weak security convinced to participate in such practice, and regulators
measures in the network or non-up-to-date features that also need to know suitable safeguards have been estab-
may require re-implementation.79,80 lished. Moreover, as can be seen in Table 2, even with the
The distributed learning and blockchain technologies use of distributed learning, the size of the data pool learned
presented here show that there are emerging data science from remains rather small. In the future, the integration of
solutions that begin to meet the concerns and shortcom- blockchain technology to distributed learning networks
ings of the law. The problems of re-identification are greatly could be considered, as it ensures transparency and
reduced and managed through the technologies. Clearly, traceability while following FAIR data principles and can
there are conceptual issues of understanding the impact of facilitate the implementation of distributed learning.
AFFILIATIONS 8295; Interreg V-A Euregio Meuse-Rhine “Euradiomics” Grant No.

1
The D-Lab, Department of Precision Medicine, GROW School for EMR4; and the Scientific Exchange from Swiss National Science
Oncology and Developmental Biology, Maastricht University Medical Foundation Grant No. IZSEZ0_180524.
Centre, Maastricht, The Netherlands
2
Oncoradiomics, Liège, Belgium AUTHOR CONTRIBUTIONS
3
Department of Radiation Oncology, University Hospital Zurich and Conception and design: All authors
University of Zurich, Zurich, Switzerland Financial support: Sean Walsh
4
Department of Health, Ethics, and Society, CAPHRI (Care and Public Administrative support: Sean Walsh
Health Research Institute), Maastricht University, Maastricht, The Provision of study material or patients: Sean Walsh
Netherlands Collection and assembly of data: Fadila Zerka, Samir Barakat,
Ralph T.H. Leijenaar
CORRESPONDING AUTHOR Data analysis and interpretation: Fadila Zerka, Samir Barakat,
Fadila Zerka, PhD, Universiteit Maastricht, Postbus 616, Maastricht Ralph T.H. Leijenaar, David Townend, Philippe Lambin
6200 MD, the Netherlands; e-mail: f.zerka@maastrichtuniversity.nl. Manuscript writing: All authors
Final approval of manuscript: All authors
Accountable for all aspects of the work: All authors
SUPPORT
Supported by European Research Council advanced grant ERC-ADG-
2015 Grant No. 694812, Hypoximmuno; the Dutch technology
AUTHORS’ DISCLOSURES OF POTENTIAL CONFLICTS OF
Foundation Stichting Technische Wetenschappen Grant No. P14-19 INTEREST
Radiomics STRaTegy, which is the applied science division of Dutch The following represents disclosure information provided by authors of
Research Council (De Nederlandse Organisatie voor Wetenschappelijk) this manuscript. All relationships are considered compensated unless
the Technology Program of the Ministry of Economic Affairs; Small and otherwise noted. Relationships are self-held unless noted. I = Immediate
Medium-Sized Enterprises Phase 2 RAIL Grant No. 673780; Family Member, Inst = My Institution. Relationships may not relate to the
EUROSTARS, DART Grant No. E10116 and DECIDE Grant No. E11541; subject matter of this manuscript. For more information about ASCO’s
the European Program PREDICT ITN Grant No. 766276; Third Joint conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.
Transnational Call 2016 JTC2016 “CLEARLY” Grant No. UM 2017- org/cci/author-center.

Open Payments is a public database containing information reported by Benjamin Miraglio

companies about payments made to US-licensed physicians (Open Employment: OncoRadiomics
Payments).
Philippe Lambin
Fadila Zerka Employment: Convert Pharmaceuticals
Employment: Oncoradiomics Leadership: DNAmito
Research Funding: PREDICT Stock and Other Ownership Interests: BHV, Oncoradiomics, Convert
Pharmaceuticals, The Medical Cloud Company
Samir Barakat
Honoraria: Varian Medical
Employment: PtTheragnostic
Consulting or Advisory Role: BHV, Oncoradiomics
Leadership: PtTheragnostic
Research Funding: ptTheragnostic
Sean Walsh Patents, Royalties, Other Intellectual Property: Co-inventor of two issued
Employment: Oncoradiomics patents with royalties on radiomics (PCT/NL2014/050248, PCT/
Leadership: Oncoradiomics NL2014/050728) licensed to Oncoradiomics and one issued patent on
Stock and Other Ownership Interests: Oncoradiomics mtDNA (PCT/EP2014/059089) licensed to ptTheragnostic/DNAmito,
Research Funding: Varian Medical Systems (Inst) three nonpatentable inventions (software) licensed to ptTheragnostic/
DNAmito, Oncoradiomics, and Health Innovation Ventures.
Ralph T. H. Leijenaar Travel, Accommodations, Expenses: ptTheragnostic, Elekta, Varian
Employment: Oncoradiomics Medical
Leadership: Oncoradiomics
Stock and Other Ownership Interests: Oncoradiomics David Townend
Patents, Royalties, Other Intellectual Property: Image analysis method Consulting or Advisory Role: Newron Pharmaceuticals (I)
supporting illness development prediction for a neoplasm in a human or
No other potential conflicts of interest were reported.
animal body (PCT/NL2014/050728)
Arthur Jochems ACKNOWLEDGMENT

Stock and Other Ownership Interests: Oncoradiomics, Medical Cloud We thank Simone Moorman for editing the manuscript.
Company
REFERENCES
1. Mitchell TM: Machine Learning International ed., [Reprint.]. New York, NY, McGraw-Hill, 1997
2. Boyd S, Parikh N, Chu E, et al: Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in
Machine Learning 3:1-122, 2010
3. Cardoso I, Almeida E, Allende-Cid H, et al: Analysis of machine learning algorithms for diagnosis of diffuse lung diseases. Methods Inf Med 57:272-279, 2018
4. Wang X, Peng Y, Lu L, et al: ChestX-Ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of
common thorax diseases. Presented at 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI, July 21-26, 2017
18
5. Ding Y, Sohn JH, Kawczynski MG, et al: A deep learning model to predict a diagnosis of Alzheimer disease by using F-FDG PET of the brain. Radiology
290:456-464, 2019
6. Emmert-Streib F, Dehmer M: A machine learning perspective on personalized medicine: An automized, comprehensive knowledge base with ontology for
pattern recognition. Mach Learn Knowl Extr 1:149-156, 2018
7. Deist TM, Dankers FJWM, Valdes G, et al: Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.
Med Phys 45:3449-3459, 2018
8. Lambin P, van Stiphout RG, Starmans MH, et al: Predicting outcomes in radiation oncology multifactorial decision support systems. Nat Rev Clin Oncol
10:27-40, 2013
9. Wang S, Summers RM: Machine learning and radiology. Med Image Anal 16:933-951, 2012
10. James G, Witten D, Hastie T, et al: An introduction to statistical learning: With applications in R. New York, NY, Springer, 2017
11. Sutton RS, Barto AG: Reinforcement Learning: An Introduction. https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
12. Deng L: Deep learning: Methods and applications. Foundations and Trends in Signal Processing 7:197-387, 2014
13. LeCun Y, Bengio Y, Hinton G: Deep learning. Nature 521:436-444, 2015
14. Garling C: Andrew Ng: Why ‘deep learning’ is a mandate for humans, not just machines. Wired 2015. https://www.wired.com/brandlab/2015/05/andrew-ng-
deep-learning-mandate-humans-not-just-machines/
15. Pesapane F, Codari M, Sardanelli F: Artificial intelligence in medical imaging: Threat or opportunity? Radiologists again at the forefront of innovation in
medicine. Eur Radiol Exp 2:35, 2018
16. Liberati A, Altman DG, Tetzlaff J, et al: The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care
interventions: Explanation and elaboration. PLoS Med 6:e1000100, 2009
16a. Intersoft Consulting: General Data Protection Regulation: Recitals. https://gdpr-info.eu/recitals/no-26/
17. MAASTRO Clinic: euroCAT: Distributed learning. https://youtu.be/nQpqMIuHyOk
18. Rennock MJW, Cohn A, Butcher JR: Blockchain technology and regulatory investigations. https://www.steptoe.com/images/content/1/7/v2/171967/LIT-
FebMar18-Feature-Blockchain.pdf
19. Orlhac F, Frouin F, Nioche C, et al: Validation of a method to compensate multicenter effects affecting CT radiomics. Radiology 291:53-59, 2019
20. Goodfellow I, Bengio Y, Courville A: Deep Learning. https://www.deeplearningbook.org/
21. Lambin P, Roelofs E, Reymen B, et al: Rapid Learning health care in oncology - an approach towards decision support systems enabling customised
radiotherapy. Radiother Oncol 109:159-164, 2013
22. Lustberg T, van Soest J, Jochems A, et al: Big Data in radiation therapy: Challenges and opportunities. Br J Radiol 90:20160689, 2017

Zerka et al
23. Deist TM, Jochems A, van Soest J, et al: Infrastructure and distributed learning methodology for privacy-preserving multi-centric rapid learning health care:
euroCAT. Clin Transl Radiat Oncol 4:24-31, 2017
24. Price G, van Herk M, Faivre-Finn C: Data mining in oncology: The ukCAT project and the practicalities of working with routine patient data. Clin Oncol (R Coll
Radiol) 29:814-817, 2017
25. Dean J, Corrado G, Monga R, et al: Large Scale Distributed Deep Networks. Advances in Neural Information Processing Systems 25, 2012, 1223-1231. https://
papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012
26. Cireşan D, Meier U, Schmidhuber J: Multi-column deep neural networks for image classification. http://arxiv.org/abs/1202.2745
27. Radiuk PM: Impact of training set batch size on the performance of convolutional neural networks for diverse datasets. Information Technology and
Management Science 20:20-24, 2017
28. Keskar NS, Mudigere D, Nocedal J, et al: On large-batch training for deep learning: generalization gap and sharp minima. http://arxiv.org/abs/1609.04836
29. Papernot N, Abadi M, Erlingsson Ú, et al: Semi-supervised knowledge transfer for deep learning from private training data. http://arxiv.org/abs/1610.05755
30. Shokri R, Shmatikov V: Privacy-preserving deep learning, in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security -
CCS ’15. Denver, Colorado, ACM Press, 2015, pp 1310-1321.
31. Predd JB, Kulkarni SB, Poor HV: Distributed learning in wireless sensor networks. IEEE Signal Process Mag 23:56-69, 2006
32. Ji X, Hou C, Hou Y, et al: A distributed learning method for l 1 -regularized kernel machine over wireless sensor networks. Sensors (Basel) 16:1021, 2016
33. Chang K, Balachandar N, Lam C, et al: Distributed deep learning networks among institutions for medical imaging. J Am Med Inform Assoc 25:945-954, 2018
34. McClure P, Zheng CY, Kaczmarzyk J, et al: Distributed Weight Consolidation: A Brain Segmentation Case Study. https://arxiv.org/abs/1805.10863
35. FreeSurferWiki: FreeSurfer. http://freesurfer.net/fswiki/FreeSurferWiki
36. Sheller MJ, Reina GA, Edwards B, et al: Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation.
http://arxiv.org/abs/1810.04304
37. Li W, Milletarı̀ F, Xu D, et al: Privacy-preserving federated brain tumour segmentation. http://arxiv.org/abs/1910.00962
38. Abadi M, Chu A, Goodfellow I, et al: Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Com-
munications Security – CCS’16. 308-318, 2016
39. Mishkin D, Sergievskiy N, Matas J: Systematic evaluation of convolution neural network advances on the Imagenet. Comput Vis Image Underst 161:11-19,
2017
40. Lin T, Stich SU, Patel KK, et al: Don’t use large mini-batches, use local SGD. http://arxiv.org/abs/1808.07217
41. Biryukov A, De Cannière C, Winkler WE, et al: Discretionary access control policies (DAC), in van Tilborg HCA, Jajodia S (eds): Encyclopedia of Cryptography
and Security. Boston, MA, Springer, 2011, pp 356-358
42. Pinkas B: Cryptographic techniques for privacy-preserving data mining. SIGKDD Explor 4:12-19, 2002
43. Siegel RL, Miller KD, Jemal A: Cancer statistics, 2017. CA Cancer J Clin 67:7-30, 2017
44. Bray F, Ferlay J, Soerjomataram I, et al: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185
countries. CA Cancer J Clin 68:394-424, 2018
45. Siegel R, DeSantis C, Virgo K, et al: Cancer treatment and survivorship statistics, 2012. CA Cancer J Clin 62:220-241, 2012
46. Shortliffe EH, Barnett GO: Medical data: Their acquisition, storage, and use, in Shortliffe EH, Perreault LE (eds): Medical Informatics. New York, NY, Springer,
2001, pp 41-75
47. Shabani M, Vears D, Borry P: Raw genomic data: Storage, access, and sharing. Trends Genet 34:8-10, 2018
48. Langer SG: Challenges for data storage in medical imaging research. J Digit Imaging 24:203-207, 2011
49. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al: The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3:160018, 2016
50. Wilkinson MD, Sansone S-A, Schultes E, et al: A design framework and exemplar metrics for FAIRness. Sci Data 5:180118, 2018
51. Dumontier M, Gray AJG, Marshall MS, et al: The health care and life sciences community profile for dataset descriptions. PeerJ 4:e2331, 2016
52. Jagodnik KM, Koplev S, Jenkins SL, et al: Developing a framework for digital objects in the Big Data to Knowledge (BD2K) commons: Report from the
Commons Framework Pilots workshop. J Biomed Inform 71:49-57, 2017
53. Polanin JR, Terzian M: A data-sharing agreement helps to increase researchers’ willingness to share primary data: Results from a randomized controlled trial. J
Clin Epidemiol 106:60-69, 2018
54. Azzariti DR, Riggs ER, Niehaus A, et al: Points to consider for sharing variant-level information from clinical genetic testing with ClinVar. Cold Spring Harb Mol
Case Stud 4:a002345, 2018
55. Boué S, Byrne M, Hayes AW, et al: Embracing transparency through data sharing. Int J Toxicol 10.1177/1091581818803880
56. Poline J-B, Breeze JL, Ghosh S, et al: Data sharing in neuroimaging research. Front Neuroinform 6:9 2012
57. Cutts FT, Enwere G, Zaman SMA, et al: Operational challenges in large clinical trials: Examples and lessons learned from the Gambia pneumococcal vaccine
trial. PLoS Clin Trials 1:e16 2006
58. Xia W, Wan Z, Yin Z, et al: It’s all in the timing: Calibrating temporal penalties for biomedical data sharing. J Am Med Inform Assoc 25:25-31, 2018
59. Fleishon H, Muroff LR, Patel SS: Change management for radiologists. J Am Coll Radiol 14:1229-1233, 2017
60. Delaney R, D’Agostino R: The challenges of integrating new technology into an organization. https://digitalcommons.lasalle.edu/cgi/viewcontent.cgi?
article=1024&context=mathcompcapstones
61. Agboola A, Salawu R: Managing deviant behavior and resistance to change. Int J Bus Manage 6:235, 2010
62. Jochems A, Deist TM, van Soest J, et al: Distributed learning: Developing a predictive model based on data from multiple hospitals without data leaving the
hospital - A real life proof of concept. Radiother Oncol 121:459-467, 2016
63. Jochems A, Deist TM, El Naqa I, et al: Developing and validating a survival prediction model for NSCLC patients through distributed learning across 3 countries.
Int J Radiat Oncol Biol Phys 99:344-352, 2017
63a. Deist TM, Dankers FJWM, Ojha P, et al: Distributed learning on 20 000+ lung cancer patients - The Personal Health Train. Radiother Oncol 144:189-200,
2020
64. Tagliaferri L, Gobitti C, Colloca GF, et al: A new standardized data collection system for interdisciplinary thyroid cancer management: Thyroid COBRA. Eur
J Intern Med 53:73-78, 2018
65. Brisimi TS, Chen R, Mela T, et al: Federated learning of predictive models from federated Electronic Health Records. Int J Med Inform 112:59-67, 2018
66. Dluhoš P, Schwarz D, Cahn W, et al: Multi-center machine learning in imaging psychiatry: A meta-model approach. Neuroimage 155:10-24, 2017

67. Dhillon V, Metcalf D, Hooper M: Blockchain in health care, in Dhillon V, Metcalf D, Hooper M (eds): Blockchain Enabled Applications: Understand the
Blockchain Ecosystem and How to Make it Work for You. Berkeley, CA, Apress, 2017, pp 125-138
68. Lugan S, Desbordes P, Tormo LXR, et al: Secure architectures implementing trusted coalitions for blockchained distributed learning (TCLearn). http://arxiv.
org/abs/1906.07690
69. Nakamoto S: Bitcoin: A peer-to-peer electronic cash system. https://bitcoin.org/bitcoin.pdf
70. Gordon WJ, Catalini C: Blockchain technology for healthcare: Facilitating the transition to patient-driven interoperability. Comput Struct Biotechnol J
16:224-230, 2018
71. Kamel Boulos MN, Wilson JT, Clauson KA: Geospatial blockchain: Promises, challenges, and scenarios in health and healthcare. Int J Health Geogr 17:25
2018
72. Pirtle C, Ehrenfeld J: Blockchain for healthcare: The next generation of medical records? J Med Syst 42:172, 2018
73. Zhang P, White J, Schmidt DC, et al: FHIRChain: Applying blockchain to securely and scalably share clinical data. Comput Struct Biotechnol J 16:267-278,
2018
74. Dubovitskaya A, Xu Z, Ryu S, et al: Secure and trustable electronic medical records sharing using blockchain. AMIA Annu Symp Proc 2017:650-659, 2018
75. Vruddhula S: Application of on-dose identification and blockchain to prevent drug counterfeiting. Pathog Glob Health 112:161, 2018
76. Ji Y, Zhang J, Ma J, et al: BMPLS: Blockchain-based multi-level privacy-preserving location sharing scheme for telecare medical information systems. J Med
Syst 42:147, 2018
77. Coventry L, Branley D: Cybersecurity in healthcare: A narrative review of trends, threats and ways forward. Maturitas 113:48-52, 2018
78. Jalali MS, Kaiser JP: Cybersecurity in hospitals: A systematic, organizational perspective. J Med Internet Res 20:e10059, 2018
79. Vlahović-Palčevski V, Mentzer D: Postmarketing surveillance, in Seyberth HW, Rane A, Schwab M (eds): Pediatric Clinical Pharmacology. Berlin, Springer,
2011, pp 339-351
80. Parkash R, Thibault B, Philippon F, et al: Canadian Registry of Implantable Electronic Device outcomes: Surveillance of high-voltage leads. Can J Cardiol
34:808-811, 2018
81. Ing EB, Ing R: The use of a nomogram to visually interpret a logistic regression prediction model for giant cell arteritis. Neuroophthalmology 42:284-286, 2018
82. Tirzīte M, Bukovskis M, Strazda G, et al: Detection of lung cancer with electronic nose and logistic regression analysis. J Breath Res 13: 016006, 2018
83. Ji Z, Jiang X, Wang S, et al: Differentially private distributed logistic regression using private and public data. BMC Med Genomics 7:S14, 2014 (suppl 1)
84. Jiang W, Li P, Wang S, et al: WebGLORE: A web service for Grid LOgistic REgression. Bioinformatics 29:3238-3240, 2013
85. Wang S, Jiang X, Wu Y, et al: EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online model learning. J Biomed
Inform 46:480-496, 2013
86. Desai A, Chaudhary S: Distributed decision tree. Proceedings of the Ninth Annual ACM India Conference, Gandhinagar, India, ACM Press, 2016, pp 43-50
87. Caragea D, Silvescu A, Honavar V: Decision tree induction from distributed heterogeneous autonomous data sources, in Abraham A, Franke K, Köppen M
(eds): Intelligent Systems Design and Applications. Berlin, Springer, 2003, pp 341-350
88. Plaku E, Kavraki LE: Distributed computation of the knn graph for large high-dimensional point sets. J Parallel Distrib Comput 67:346-359, 2007
89. Xiong L, Chitti S, Liu L: Mining multiple private databases using a kNN classifier, in Proceedings of the 2007 ACM symposium on Applied computing – SAC ’07.
Seoul, Korea, ACM Press, 2007, p 435
90. Huang Z: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2:283-304, 1998
91. Jagannathan G, Wright RN: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data, in Proceeding of the eleventh ACM SIGKDD
international conference on Knowledge discovery in data mining – KDD ’05. Chicago, Illinois, USA, ACM Press, 2005, p 593
92. Jin R, Goswami A, Agrawal G: Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10:17-40, 2006
93. Jagannathan G, Pillaipakkamnatt K, Wright RN: A new privacy-preserving distributed k -clustering algorithm, in Proceedings of the 2006 SIAM International
Conference on Data Mining. Society for Industrial and Applied Mathematics, 2006, pp 494-498
94. Ye Y, Chiang C-C: A parallel apriori algorithm for frequent itemsets mining, in Fourth International Conference on Software Engineering Research, Management
and Applications (SERA’06). Seattle, WA, IEEE, 2006, pp 87-94
95. Cheung DW, Ng VT, Fu AW, et al: Efficient mining of association rules in distributed databases. IEEE Trans Knowl Data Eng 8:911-922, 1996
96. Bellman R: A Markovian decision process. Indiana Univ Math J 6:679-684, 1957
97. Puterman ML: Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York, NY, John Wiley & Sons, 2014
98. Watkins CJCH, Dayan P: Q-learning. Mach Learn 8:279-292, 1992
99. Lauer M, Riedmiller M: An algorithm for distributed reinforcement learning in cooperative multi-agent systems, in Proceedings of the Seventeenth International
Conference on Machine Learning. Burlington, MA, Morgan Kaufmann, 2000, pp 535-542. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.772
n n n

Zerka et al
APPENDIX
Records identified through Additional records identified
Identification
database searching through other sources
(n = 127) (n = 0)
Records after duplicates removed

(n = 127)
Screening
Records screened Records excluded

(n = 6) (n = 121)
Eligibility
Full-text articles assessed Full-text articles excluded,

for eligibility with reasons
(n = 6) (n = 0)
Studies included in
qualitative synthesis
(n = 6)
Included
Studies included in
quantitative synthesis
(meta-analysis)
(n = 6)
FIG A1. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2009 flow diagram.

TABLE A2. PRISMA 2009 Checklist

Reported on
Section/Topic No. Checklist Item Page No.
Title
Title 1 Identify the report as a systematic review, meta-analysis, or both. 1
Abstract
Structured summary 2 Provide a structured summary including, as applicable: background; 1
objectives; data sources; study eligibility criteria, participants, and
interventions; study appraisal and synthesis methods; results;
limitations; conclusions and implications of key findings; systematic
review registration number.
Introduction
Rationale 3 Describe the rationale for the review in the context of what is already 1-5
known.
Objectives 4 Provide an explicit statement of questions being addressed with reference 2
to PICOS.
Methods
Protocol and registration 5 Indicate if a review protocol exists, if and where it can be accessed (eg, 5
Web address), and, if available, provide registration information
including registration number.
Eligibility criteria 6 Specify study characteristics (eg, PICOS, length of follow-up) and report 5
characteristics (eg, years considered, language, publication status)
used as criteria for eligibility, giving rationale.
Information sources 7 Describe all information sources (eg, databases with dates of coverage, 5
contact with study authors to identify additional studies) in the search
and date last searched.
Search 8 Present full electronic search strategy for at least one database, including 5
any limits used, such that it could be repeated.
Study selection 9 State the process for selecting studies (ie, screening, eligibility, included 5
in systematic review, and, if applicable, included in the meta-analysis). (and Fig A1)
Data collection process 10 Describe method of data extraction from reports (eg, piloted forms, 5
independently, in duplicate) and any processes for obtaining and
confirming data from investigators.
Data items 11 List and define all variables for which data were sought (eg, PICOS, N/A
funding sources) and any assumptions and simplifications made.
Risk of bias in individual studies 12 Describe methods used for assessing risk of bias of individual studies N/A
(including specification of whether this was done at the study or
outcome level) and how this information is to be used in any data
synthesis.
Summary measures 13 State the principal summary measures (eg, risk ratio, difference in N/A
means).
Synthesis of results 14 Describe the methods of handling data and combining results of studies, 5
if done, including measures of consistency (eg, I2) for each
meta-analysis.
Risk of bias across studies 15 Specify any assessment of risk of bias that may affect the cumulative N/A
evidence (eg, publication bias, selective reporting within studies).
Additional analyses 16 Describe methods of additional analyses (eg, sensitivity or subgroup N/A
analyses, meta-regression), if done, indicating which were
prespecified.
Results
Study selection 17 Give numbers of studies screened, assessed for eligibility, and included in 5
the review, with reasons for exclusions at each stage, ideally with a flow (and Fig A1)
diagram.
Study characteristics 18 For each study, present characteristics for which data were extracted (eg, 5-8
study size, PICOS, follow-up period) and provide the citations.

Zerka et al
TABLE A2. PRISMA 2009 Checklist (Continued)

Reported on
Section/Topic No. Checklist Item Page No.
Risk of bias within studies 19 Present data on risk of bias of each study and, if available, any N/A
outcome-level assessment (see item 12).
Results of individual studies 20 For all outcomes considered (benefits or harms), present, for each study: 5-8
(a) simple summary data for each intervention group, and (b) effect
estimates and confidence intervals, ideally with a forest plot.
Synthesis of results 21 Present results of each meta-analysis done, including confidence 5-8
intervals and measures of consistency.
Risk of bias across studies 22 Present results of any assessment of risk of bias across studies (see Item N/A
15).
Additional analysis 23 Give results of additional analyses, if done (eg, sensitivity or subgroup N/A
analyses, meta-regression [see Item 16]).
Discussion
Summary of evidence 24 Summarize the main findings, including the strength of evidence for each 8
main outcome; consider their relevance to key groups (eg, health care
providers, users, and policy makers).
Limitations 25 Discuss limitations at study and outcome level (eg, risk of bias), and at 10
review level (eg, incomplete retrieval of identified research, reporting
bias).
Conclusions 26 Provide a general interpretation of the results in the context of other 11
evidence, and implications for future research.
Funding
Funding 27 Describe sources of funding for the systematic review and other support 11
(eg, supply of data); role of funders for the systematic review.
Abbreviations: N/A, not applicable; PICOS, participants, interventions, comparisons, outcomes, and study design; PRISMA, Preferred
Reporting Items for Systematic Reviews and Meta-Analyses.


Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care

Uploaded by

Copyright:

Available Formats

Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Systematic Review of Privacy-Preserving Distributed Machine Learning From Federated Databases in Health Care

Uploaded by

Copyright:

Available Formats

Systematic Review of Privacy-Preserving

review articles Distributed Machine Learning From Federated

INTRODUCTION The methodology for this research is that distributed

a narrowing of the deﬁnition of informed consent in Article MACHINE LEARNING

JCO Clinical Cancer Informatics 185

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

186 © 2020 by American Society of Clinical Oncology

Copyright © 2020 American Society of Clinical Oncology. All rights reserved.

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

JCO Clinical Cancer Informatics

Copyright © 2020 American Society of Clinical Oncology. All rights reserved.

Distributed Machine Learning

188 © 2020 by American Society of Clinical Oncology

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

JCO Clinical Cancer Informatics

Copyright © 2020 American Society of Clinical Oncology. All rights reserved.

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

190 © 2020 by American Society of Clinical Oncology

distributed mode) sends algorithms directly to

Copyright © 2020 American Society of Clinical Oncology. All rights reserved.

A Each hospital is responsible

1. Extract local data We would like

2. Data anonymization Learning machine/hospital

2 8 3 C Distributed learning flow chart of the above network

Hospital 1 Hospital 2 Hospital 3 Hospital 1 Hospital 2 Hospital 3

3. Semantic web Step 4 Convergence Step 3 Waiting for all

JCO Clinical Cancer Informatics 191

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

192 © 2020 by American Society of Clinical Oncology

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

a long-term vision of the improvements that it can bring, FUTURE PERSPECTIVES

The block is broadcasted to

New block Hospital

Existing block in the chain

FIG 4. Visual representation of blockchain. Adapted from Rennock et al.18

JCO Clinical Cancer Informatics 193

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

AFFILIATIONS 8295; Interreg V-A Euregio Meuse-Rhine “Euradiomics” Grant No.

194 © 2020 by American Society of Clinical Oncology

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

Open Payments is a public database containing information reported by Benjamin Miraglio

Arthur Jochems ACKNOWLEDGMENT

JCO Clinical Cancer Informatics 195

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

196 © 2020 by American Society of Clinical Oncology

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

JCO Clinical Cancer Informatics 197

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

Records identified through Additional records identified

Records after duplicates removed

Records screened Records excluded

Full-text articles assessed Full-text articles excluded,

198 © 2020 by American Society of Clinical Oncology

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

TABLE A2. PRISMA 2009 Checklist

JCO Clinical Cancer Informatics 199

Downloaded from ascopubs.org by 95.90.245.240 on August 26, 2020 from 095.090.245.240

TABLE A2. PRISMA 2009 Checklist (Continued)

200 © 2020 by American Society of Clinical Oncology