Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3604237.3626876acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicaifConference Proceedingsconference-collections
research-article
Open access

FinDiff: Diffusion Models for Financial Tabular Data Generation

Published: 25 November 2023 Publication History

Abstract

The sharing of microdata, such as fund holdings and derivative instruments, by regulatory institutions presents a unique challenge due to strict data confidentiality and privacy regulations. These challenges often hinder the ability of both academics and practitioners to conduct collaborative research effectively. The emergence of generative models, particularly diffusion models, capable of synthesizing data mimicking the underlying distributions of real-world data presents a compelling solution. This work introduces Financial Tabular Diffusion (FinDiff), a diffusion model designed to generate real-world mixed-type financial tabular data for a variety of downstream tasks, for example, economic scenario modeling, stress tests, and fraud detection. The model uses embedding encodings to model mixed modality financial data, comprising both categorical and numeric attributes. The performance of FinDiff in generating synthetic tabular financial data is evaluated against state-of-the-art baseline models using three real-world financial datasets (including two publicly available datasets and one proprietary dataset). Empirical results demonstrate that FinDiff excels in generating synthetic tabular financial data with high fidelity, privacy, and utility.

References

[1]
Samuel A. Assefa, Danial Dervovic, Mahmoud Mahfouz, Robert E. Tillman, Prashant Reddy, and Manuela Veloso. 2021. Generating Synthetic Data in Finance: Opportunities, Challenges and Pitfalls. In Proceedings of the First ACM International Conference on AI in Finance (New York, New York) (ICAIF ’20). Association for Computing Machinery, New York, NY, USA, Article 44, 8 pages. https://doi.org/10.1145/3383455.3422554
[2]
Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise reduction in speech processing. Springer, 37–40.
[3]
J. Blaschke, H. Haupenthal, V. Schuck, and E. Yalcin. 2021. Investment Funds Statistics Base 09/2009-06/2021. Data Report 2021-19 - Metadata Version 4-1. Deutsche Bundesbank Research Data and Service Centre.
[4]
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2020. WaveGrad: Estimating Gradients for Waveform Generation. arxiv:2009.00713 [eess.AS]
[5]
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion Models in Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023), 1–20. https://doi.org/10.1109/TPAMI.2023.3261988
[6]
Mihai Dogariu and Traian Rebedea. 2021. Synthetic Financial Time Series Generation using Generative Adversarial Networks. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2021). https://doi.org/10.1145/3490354.3494393
[7]
Justin Engelmann and Stefan Lessmann. 2021. Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications 174 (2021), 114582.
[8]
Ju Fan, Tongyu Liu, Guoliang Li, Junyou Chen, Yuwei Shen, and Xiaoyong Du. 2020. Relational data synthesis using generative adversarial networks: A design space exploration. arXiv preprint arXiv:2008.12763 (2020).
[9]
Joao Fonseca and Fernando Bacao. 2023. Tabular and latent space synthetic data generation: a literature review. Journal of Big Data 10, 1 (July 2023), 115. https://doi.org/10.1186/s40537-023-00792-7
[10]
Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, and Linli Xu. 2023. Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation. arxiv:2212.09412 [cs.CL]
[11]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics(Proceedings of Machine Learning Research, Vol. 9), Yee Whye Teh and Mike Titterington (Eds.). PMLR, Chia Laguna Resort, Sardinia, Italy, 249–256. https://proceedings.mlr.press/v9/glorot10a.html
[12]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
[13]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arxiv:2006.11239 [cs.LG]
[14]
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. 2021. Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.). Vol. 34. Curran Associates, Inc., 12454–12465. https://proceedings.neurips.cc/paper_files/paper/2021/file/67d96d458abdef21792e6d8e590244e7-Paper.pdf
[15]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
[16]
Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1312.6114
[17]
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In ICLR. https://openreview.net/forum?id=a-xFK8Ymz5J
[18]
Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2022. TabDDPM: Modelling Tabular Data with Diffusion Models. arxiv:2209.15421 [cs.LG]
[19]
Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. 2022. Diffusion-LM Improves Controllable Text Generation. In Advances in Neural Information Processing Systems.
[20]
Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).
[21]
F. J. Massey. 1951. The Kolmogorov-Smirnov test for goodness of fit. J. Amer. Statist. Assoc. 46, 253 (1951), 68–78.
[22]
Hao Ni, Lukasz Szpruch, Marc Sabate-Vidales, Baoren Xiao, Magnus Wiese, and Shujian Liao. 2021. Sig-Wasserstein GANs for Time Series Generation. arxiv:2111.01207 [cs.LG]
[23]
Hao Ni, Lukasz Szpruch, Magnus Wiese, Shujian Liao, and Baoren Xiao. 2020. Conditional Sig-Wasserstein GANs for Time Series Generation. arxiv:2006.05421 [cs.LG]
[24]
Alex Nichol and Prafulla Dhariwal. 2021. Improved Denoising Diffusion Probabilistic Models. arxiv:2102.09672 [cs.LG]
[25]
Yidong Ouyang, Liyan Xie, Chongxuan Li, and Guang Cheng. 2023. MissDiff: Training Diffusion Models on Tabular Data with Missing Values. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling. https://openreview.net/forum?id=S435pkeAdT
[26]
Adam Paszke and Sam et al. Gross. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[27]
Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. 2016. The Synthetic data vault. In IEEE International Conference on Data Science and Advanced Analytics (DSAA). 399–410. https://doi.org/10.1109/DSAA.2016.49
[28]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv:2112.10752 [cs.CV]
[29]
Marco Schreyer, Timur Sattarov, Bernd Reimer, and Damian Borth. 2019. Adversarial Learning of Deepfakes in Accounting. arxiv:1910.03810 [cs.LG]
[30]
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arxiv:1503.03585 [cs.LG]
[31]
Jiaming Song, Chenlin Meng, and Stefano Ermon. 2022. Denoising Diffusion Implicit Models. arxiv:2010.02502 [cs.LG]
[32]
Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, and Rémi Leblond. 2022. Self-conditioned Embedding Diffusion for Text Generation. arxiv:2211.04236 [cs.CL]
[33]
L Vivek Harsha Vardhan and Stanley Kok. 2020. Generating privacy-preserving synthetic tabular data using oblivious variational autoencoders. In Proceedings of the Workshop on Economics of Privacy and Data Labor at the 37 th International Conference on Machine Learning.
[34]
Zhiqiang Wan, Yazhou Zhang, and Haibo He. 2017. Variational autoencoder based synthetic data generation for imbalanced learning. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI). 1–7.
[35]
Jannes Wiese, Andre Knobloch, Walther Kretschmer, and Thomas Huschto. 2019. Quant GANs: Deep Generation of Financial Time Series. arXiv preprint arXiv:1907.06673 (2019).
[36]
Jinhong Wu, Konstantinos Plataniotis, Lucy Liu, Ehsan Amjadian, and Yuri Lawryshyn. 2023. Interpretation for Variational Autoencoder Used to Generate Financial Synthetic Tabular Data. Algorithms 16, 2 (2023). https://www.mdpi.com/1999-4893/16/2/121
[37]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional gan. NeurIPS 32 (2019).
[38]
I. C. Yeh and C. H. Lien. 2009. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients.
[39]
In‐Kwon Yeo and Richard A. Johnson. 2000. A new family of power transformations to improve normality or symmetry. Biometrika 87, 4 (Dec. 2000), 954–959. https://doi.org/10.1093/biomet/87.4.954 _eprint: https://academic.oup.com/biomet/article-pdf/87/4/954/633221/870954.pdf.
[40]
Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. 2023. Text-to-image Diffusion Models in Generative AI: A Survey. arxiv:2303.07909 [cs.CV]
[41]
Hao Zou, Zae Myung Kim, and Dongyeop Kang. 2023. A Survey of Diffusion Models in Natural Language Processing. arxiv:2305.14671 [cs.CL]

Cited By

View all
  • (2024)Imb-FinDiff: Conditional Diffusion Models for Class Imbalance Synthesis of Financial Tabular DataProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698659(617-625)Online publication date: 14-Nov-2024
  • (2024)FraudDiffuse: Diffusion-aided Synthetic Fraud Augmentation for Improved Fraud DetectionProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698658(90-98)Online publication date: 14-Nov-2024
  • (2024)Entity-based Financial Tabular Data Synthesis with Diffusion ModelsProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698625(547-554)Online publication date: 14-Nov-2024
  • Show More Cited By

Index Terms

  1. FinDiff: Diffusion Models for Financial Tabular Data Generation

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICAIF '23: Proceedings of the Fourth ACM International Conference on AI in Finance
      November 2023
      697 pages
      ISBN:9798400702402
      DOI:10.1145/3604237
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 November 2023

      Check for updates

      Author Tags

      1. diffusion models
      2. financial tabular data
      3. neural networks
      4. synthetic data generation

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ICAIF '23

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2,262
      • Downloads (Last 6 weeks)389
      Reflects downloads up to 21 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Imb-FinDiff: Conditional Diffusion Models for Class Imbalance Synthesis of Financial Tabular DataProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698659(617-625)Online publication date: 14-Nov-2024
      • (2024)FraudDiffuse: Diffusion-aided Synthetic Fraud Augmentation for Improved Fraud DetectionProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698658(90-98)Online publication date: 14-Nov-2024
      • (2024)Entity-based Financial Tabular Data Synthesis with Diffusion ModelsProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698625(547-554)Online publication date: 14-Nov-2024
      • (2024)A tabular data generation framework guided by downstream tasks optimizationScientific Reports10.1038/s41598-024-65777-914:1Online publication date: 3-Jul-2024
      • (2024)Challenges and applications in generative AI for clinical tabular data in physiologyPflügers Archiv - European Journal of Physiology10.1007/s00424-024-03024-wOnline publication date: 17-Oct-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media