Computer Science > Machine Learning

arXiv:2208.08655 (cs)

[Submitted on 18 Aug 2022 (v1), last revised 20 Jan 2023 (this version, v2)]

Title:Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV

Authors:Nicholas I-Hsien Kuo, Federico Garcia, Anders Sönnerborg, Maurizio Zazzi, Michael Böhm, Rolf Kaiser, Mark Polizzotto, Louisa Jorm, Sebastiano Barbieri

View PDF

Abstract:Clinical data usually cannot be freely distributed due to their highly confidential nature and this hampers the development of machine learning in the healthcare domain. One way to mitigate this problem is by generating realistic synthetic datasets using generative adversarial networks (GANs). However, GANs are known to suffer from mode collapse thus creating outputs of low diversity. This lowers the quality of the synthetic healthcare data, and may cause it to omit patients of minority demographics or neglect less common clinical practices. In this paper, we extend the classic GAN setup with an additional variational autoencoder (VAE) and include an external memory to replay latent features observed from the real samples to the GAN generator. Using antiretroviral therapy for human immunodeficiency virus (ART for HIV) as a case study, we show that our extended setup overcomes mode collapse and generates a synthetic dataset that accurately describes severely imbalanced class distributions commonly found in real-world clinical variables. In addition, we demonstrate that our synthetic dataset is associated with a very low patient disclosure risk, and that it retains a high level of utility from the ground truth dataset to support the development of downstream machine learning algorithms.

Comments:	In the near future, we will make our codes and synthetic datasets publicly available to facilitate future research. Follow us on this https URL
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2208.08655 [cs.LG]
	(or arXiv:2208.08655v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2208.08655

Submission history

From: Nicholas Kuo [view email]
[v1] Thu, 18 Aug 2022 06:19:46 UTC (3,291 KB)
[v2] Fri, 20 Jan 2023 05:20:24 UTC (3,429 KB)

Computer Science > Machine Learning

Title:Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators