research-article

Open access

FinDiff: Diffusion Models for Financial Tabular Data Generation

Authors:

Timur Sattarov,

Marco Schreyer,

Damian BorthAuthors Info & Claims

ICAIF '23: Proceedings of the Fourth ACM International Conference on AI in Finance

Pages 64 - 72

https://doi.org/10.1145/3604237.3626876

Published: 25 November 2023 Publication History

All formats PDF

Abstract

The sharing of microdata, such as fund holdings and derivative instruments, by regulatory institutions presents a unique challenge due to strict data confidentiality and privacy regulations. These challenges often hinder the ability of both academics and practitioners to conduct collaborative research effectively. The emergence of generative models, particularly diffusion models, capable of synthesizing data mimicking the underlying distributions of real-world data presents a compelling solution. This work introduces Financial Tabular Diffusion (FinDiff), a diffusion model designed to generate real-world mixed-type financial tabular data for a variety of downstream tasks, for example, economic scenario modeling, stress tests, and fraud detection. The model uses embedding encodings to model mixed modality financial data, comprising both categorical and numeric attributes. The performance of FinDiff in generating synthetic tabular financial data is evaluated against state-of-the-art baseline models using three real-world financial datasets (including two publicly available datasets and one proprietary dataset). Empirical results demonstrate that FinDiff excels in generating synthetic tabular financial data with high fidelity, privacy, and utility.

References

[1]

Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. 2021. Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. In Proceedings of the First ACM International Conference on AI in Finance (New York, New York) (ICAIF ’20). Association for Computing Machinery, New York, NY, USA, Article 44, 8 pages. https://doi.org/10.1145/3383455.3422554

Digital Library

[2]

Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise reduction in speech processing. Springer, 37–40.

Digital Library

[3]

J. Blaschke, H. Haupenthal, V. Schuck, and E. Yalcin. 2021. Investment Funds Statistics Base 09/2009-06/2021. Data Report 2021-19 - Metadata Version 4-1. Deutsche Bundesbank Research Data and Service Centre.

[4]

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2020. WaveGrad: Estimating Gradients for Waveform Generation. arxiv:2009.00713 [eess.AS]

[5]

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion Models in Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023), 1–20. https://doi.org/10.1109/TPAMI.2023.3261988

Digital Library

[6]

Mihai Dogariu and Traian Rebedea. 2021. Synthetic Financial Time Series Generation using Generative Adversarial Networks. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2021). https://doi.org/10.1145/3490354.3494393

Digital Library

[7]

Justin Engelmann and Stefan Lessmann. 2021. Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications 174 (2021), 114582.

Digital Library

[8]

Ju Fan, Tongyu Liu, Guoliang Li, Junyou Chen, Yuwei Shen, and Xiaoyong Du. 2020. Relational data synthesis using generative adversarial networks: A design space exploration. arXiv preprint arXiv:2008.12763 (2020).

[9]

Joao Fonseca and Fernando Bacao. 2023. Tabular and latent space synthetic data generation: a literature review. Journal of Big Data 10, 1 (July 2023), 115. https://doi.org/10.1186/s40537-023-00792-7

[10]

Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, and Linli Xu. 2023. Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation. arxiv:2212.09412 [cs.CL]

[11]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics(Proceedings of Machine Learning Research, Vol. 9), Yee Whye Teh and Mike Titterington (Eds.). PMLR, Chia Laguna Resort, Sardinia, Italy, 249–256. https://proceedings.mlr.press/v9/glorot10a.html

[12]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).

[13]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arxiv:2006.11239 [cs.LG]

[14]

Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. 2021. Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.). Vol. 34. Curran Associates, Inc., 12454–12465. https://proceedings.neurips.cc/paper_files/paper/2021/file/67d96d458abdef21792e6d8e590244e7-Paper.pdf

[15]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980

[16]

Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.6114

[17]

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In ICLR. https://openreview.net/forum?id=a-xFK8Ymz5J

[18]

Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2022. TabDDPM: Modelling Tabular Data with Diffusion Models. arxiv:2209.15421 [cs.LG]

[19]

Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. 2022. Diffusion-LM Improves Controllable Text Generation. In Advances in Neural Information Processing Systems.

[20]

Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).

[21]

F. J. Massey. 1951. The Kolmogorov-Smirnov test for goodness of fit. J. Amer. Statist. Assoc. 46, 253 (1951), 68–78.

[22]

Hao Ni, Lukasz Szpruch, Marc Sabate-Vidales, Baoren Xiao, Magnus Wiese, and Shujian Liao. 2021. Sig-Wasserstein GANs for Time Series Generation. arxiv:2111.01207 [cs.LG]

[23]

Hao Ni, Lukasz Szpruch, Magnus Wiese, Shujian Liao, and Baoren Xiao. 2020. Conditional Sig-Wasserstein GANs for Time Series Generation. arxiv:2006.05421 [cs.LG]

[24]

Alex Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. arxiv:2102.09672 [cs.LG]

[25]

Yidong Ouyang, Liyan Xie, Chongxuan Li, and Guang Cheng. 2023. MissDiff: Training Diffusion Models on Tabular Data with Missing Values. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling. https://openreview.net/forum?id=S435pkeAdT

[26]

Adam Paszke and Sam et al. Gross. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Digital Library

[27]

Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. 2016. The Synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA). 399–410. https://doi.org/10.1109/DSAA.2016.49

[28]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv:2112.10752 [cs.CV]

[29]

Marco Schreyer, Timur Sattarov, Bernd Reimer, and Damian Borth. 2019. Adversarial Learning of Deepfakes in Accounting. arxiv:1910.03810 [cs.LG]

[30]

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arxiv:1503.03585 [cs.LG]

[31]

Jiaming Song, Chenlin Meng, and Stefano Ermon. 2022. Denoising Diffusion Implicit Models. arxiv:2010.02502 [cs.LG]

[32]

Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, and Rémi Leblond. 2022. Self-conditioned Embedding Diffusion for Text Generation. arxiv:2211.04236 [cs.CL]

[33]

L Vivek Harsha Vardhan and Stanley Kok. 2020. Generating privacy-preserving synthetic tabular data using oblivious variational autoencoders. In Proceedings of the Workshop on Economics of Privacy and Data Labor at the 37 th International Conference on Machine Learning.

[34]

Zhiqiang Wan, Yazhou Zhang, and Haibo He. 2017. Variational autoencoder based synthetic data generation for imbalanced learning. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI). 1–7.

[35]

Jannes Wiese, Andre Knobloch, Walther Kretschmer, and Thomas Huschto. 2019. Quant GANs: Deep Generation of Financial Time Series. arXiv preprint arXiv:1907.06673 (2019).

[36]

Jinhong Wu, Konstantinos Plataniotis, Lucy Liu, Ehsan Amjadian, and Yuri Lawryshyn. 2023. Interpretation for Variational Autoencoder Used to Generate Financial Synthetic Tabular Data. Algorithms 16, 2 (2023). https://www.mdpi.com/1999-4893/16/2/121

[37]

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional gan. NeurIPS 32 (2019).

[38]

I. C. Yeh and C. H. Lien. 2009. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients.

[39]

In‐Kwon Yeo and Richard A. Johnson. 2000. A new family of power transformations to improve normality or symmetry. Biometrika 87, 4 (Dec. 2000), 954–959. https://doi.org/10.1093/biomet/87.4.954 _eprint: https://academic.oup.com/biomet/article-pdf/87/4/954/633221/870954.pdf.

[40]

Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. 2023. Text-to-image Diffusion Models in Generative AI: A Survey. arxiv:2303.07909 [cs.CV]

[41]

Hao Zou, Zae Myung Kim, and Dongyeop Kang. 2023. A Survey of Diffusion Models in Natural Language Processing. arxiv:2305.14671 [cs.CL]

Cited By

Schreyer MSattarov TSim AWu K(2024)Imb-FinDiff: Conditional Diffusion Models for Class Imbalance Synthesis of Financial Tabular DataProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698659(617-625)Online publication date: 14-Nov-2024
https://dl.acm.org/doi/10.1145/3677052.3698659
Roy RTiwari DPandey A(2024)FraudDiffuse: Diffusion-aided Synthetic Fraud Augmentation for Improved Fraud DetectionProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698658(90-98)Online publication date: 14-Nov-2024
https://dl.acm.org/doi/10.1145/3677052.3698658
Liu CLiu C(2024)Entity-based Financial Tabular Data Synthesis with Diffusion ModelsProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698625(547-554)Online publication date: 14-Nov-2024
https://dl.acm.org/doi/10.1145/3677052.3698625
Show More Cited By

Index Terms

FinDiff: Diffusion Models for Financial Tabular Data Generation
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Learning latent representations
      2. Neural networks

Recommendations

Entity-based Financial Tabular Data Synthesis with Diffusion Models
ICAIF '24: Proceedings of the 5th ACM International Conference on AI in Finance

In the rapidly evolving financial industry, the adoption of synthetic tabular data is on the rise to augment scarce data and facilitate data sharing. Existing synthetic tabular data generation methods often overlook a crucial aspect of financial ...
Imb-FinDiff: Conditional Diffusion Models for Class Imbalance Synthesis of Financial Tabular Data
ICAIF '24: Proceedings of the 5th ACM International Conference on AI in Finance

Handling imbalanced datasets remains a critical challenge in financial machine-learning applications such as loan approval, credit scoring, and fraud detection. We present Imbalanced Financial Diffusion (Imb-FinDiff), a novel denoising diffusion ...
Generation of Synthetic Tabular Healthcare Data Using Generative Adversarial Networks
MultiMedia Modeling
Abstract
High-quality tabular data is a crucial requirement for developing data-driven applications, especially healthcare-related ones, because most of the data nowadays collected in this context is in tabular form. However, strict data protection laws ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICAIF '23: Proceedings of the Fourth ACM International Conference on AI in Finance

November 2023

697 pages

ISBN:9798400702402

DOI:10.1145/3604237

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 November 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICAIF '23

ICAIF '23: 4th ACM International Conference on AI in Finance

November 27 - 29, 2023

NY, Brooklyn, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
2,262
Total Downloads

Downloads (Last 12 months)2,262
Downloads (Last 6 weeks)389

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Schreyer MSattarov TSim AWu K(2024)Imb-FinDiff: Conditional Diffusion Models for Class Imbalance Synthesis of Financial Tabular DataProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698659(617-625)Online publication date: 14-Nov-2024
https://dl.acm.org/doi/10.1145/3677052.3698659
Roy RTiwari DPandey A(2024)FraudDiffuse: Diffusion-aided Synthetic Fraud Augmentation for Improved Fraud DetectionProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698658(90-98)Online publication date: 14-Nov-2024
https://dl.acm.org/doi/10.1145/3677052.3698658
Liu CLiu C(2024)Entity-based Financial Tabular Data Synthesis with Diffusion ModelsProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698625(547-554)Online publication date: 14-Nov-2024
https://dl.acm.org/doi/10.1145/3677052.3698625
Jia FZhu HJia FRen XChen STan HChan W(2024)A tabular data generation framework guided by downstream tasks optimizationScientific Reports10.1038/s41598-024-65777-914:1Online publication date: 3-Jul-2024
https://doi.org/10.1038/s41598-024-65777-9
Umesh CMahendra MBej SWolkenhauer OWolfien M(2024)Challenges and applications in generative AI for clinical tabular data in physiologyPflügers Archiv - European Journal of Physiology10.1007/s00424-024-03024-wOnline publication date: 17-Oct-2024
https://doi.org/10.1007/s00424-024-03024-w

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents