Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music Generator

Published: 27 May 2024 Publication History

Abstract

A wonderful piece of music is the essence and soul of dance, which motivates the study of automatic music generation for dance. To create appropriate music from dance, cross-modal correlations between dance and music such as rhythm and style, should be considered. However, existing dance-to-music methods have difficulties in achieving rhythmic alignment and stylistic matching simultaneously. Additionally, the diversity of generated samples is limited due to the lack of available paired data. To address these issues, we propose DanceComposer, a novel dance-to-music framework, which generates rhythmically and stylistically consistent multi-track music from dance videos. DanceComposer features a Progressive Conditional Music Generator (PCMG) that gradually incorporates rhythm and style constraints, enabling both rhythmic alignment and stylistic matching. To enhance style control, we introduce a Shared Style Module (SSM) that learns cross-modal features as stylistic constraints. This allows the PCMG can be trained on extensive music-only data and diversifies generated pieces. Quantitative and qualitative results show that our method surpasses the state-of-the-art in overall music quality, rhythmic consistency, and stylistic consistency.

References

[1]
G. Aggarwal and D. Parikh, “Dance2music: Automatic dance-driven music generation,” 2021, arXiv:2107.06252.
[2]
K. Su, X. Liu, and E. Shlizerman, “How does it sound?,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 29258–29273.
[3]
Y. Zhu et al., “Quantized GAN for complex music generation from dance videos,” in Proc. 17th Eur. Conf. Comput. Vis., 2022, pp. 182–199.
[4]
Y. Zhu et al., “Discrete contrastive diffusion for cross-modal music and image generation,” in Proc. 11th Int. Conf. Learn. Representations, 2023, pp. 1–25.
[5]
J. Yu, Y. Wang, X. Chen, X. Sun, and Y. Qiao, “Long-term rhythmic video soundtracker,” in Proc. Int. Conf. Mach. Learn., 2023, Art. no.
[6]
H.-Y. Lee et al., “Dancing to music,” in Proc. Adv. Neural Inf. Process. Syst., 2019, Art. no.
[7]
K. Chen et al., “Choreomaster: Choreography-oriented music-driven dance synthesis,” ACM Trans. Graph., vol. 40, no. 4, pp. 1–13, 2021.
[8]
R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “AI choreographer: Music conditioned 3D dance generation with AIST++,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 13401–13412.
[9]
G. Sun et al., “DeepDance: Music-to-dance motion choreography with adversarial learning,” IEEE Trans. Multimedia, vol. 23, pp. 497–509, 2021.
[10]
L. Siyao et al., “Bailando: 3D dance generation by actor-critic GPT with choreographic memory,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11050–11059.
[11]
J. Kim, H. Oh, S. Kim, H. Tong, and S. Lee, “A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 3490–3500.
[12]
V. Tan, J. Nam, J. Nam, and J. Noh, “Motion to dance music generation using latent diffusion model,” in Proc. SIGGRAPH Asia Tech. Commun., 2023, pp. 1–4.
[13]
K. Su, K. Qian, E. Shlizerman, A. Torralba, and C. Gan, “Physics-driven diffusion models for impact sound synthesis from videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 9749–9759.
[14]
P. Chen et al., “Generating visually aligned sound from videos,” IEEE Trans. Image Process., vol. 29, pp. 8292–8302, 2020.
[15]
K. Su, X. Liu, and E. Shlizerman, “Audeo: Audio generation for a silent performance video,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 3325–3337.
[16]
C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba, “Music gesture for visual sound separation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10478–10487.
[17]
C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and A. Torralba, “Foley music: Learning to generate music from videos,” in Proc. 16th Eur. Conf. Comput. Vis., Glasgow, U.K., Springer, 2020, pp. 758–775.
[18]
W.-T. Chu and S.-Y. Tsai, “Rhythm of motion extraction and rhythm-based cross-media alignment for dance videos,” IEEE Trans. Multimedia, vol. 14, no. 1, pp. 129–141, Feb. 2012.
[19]
A. Davis and M. Agrawala, “Visual rhythm and beat,” ACM Trans. Graph., vol. 37, no. 4, pp. 1–11, 2018.
[20]
F. Pedersoli and M. Goto, “Dance beat tracking from visual information alone,” in Proc. 21th Int. Soc. Music Inf. Retrieval Conf., 2020, pp. 400–408.
[21]
S. Wu, Z. Liu, S. Lu, and L. Cheng, “Dual learning music composition and dance choreography,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 3746–3754.
[22]
J. Yu, J. Pu, Y. Cheng, R. Feng, and Y. Shan, “Learning music-dance representations through explicit-implicit rhythm synchronization,” IEEE Trans. Multimedia,early access, Aug. 09, 2023.
[23]
S. Di et al., “Video background music generation with controllable music transformer,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 2037–2045.
[24]
L. Prétet, G. Richard, C. Souchier, and G. Peeters, “Video-to-music recommendation using temporal alignment of segments,” IEEE Trans. Multimedia, vol. 25, pp. 2898–2911, 2023.
[25]
J. Yi, Y. Zhu, J. Xie, and Z. Chen, “Cross-modal variational auto-encoder for content-based micro-video background music recommendation,” IEEE Trans. Multimedia, vol. 25, pp. 515–528, 2023.
[26]
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 6840–6851.
[27]
A. v. d. Oord et al., “WaveNet: A generative model for raw audio,” in Proc. 9th ISCA Speech Synth. Workshop, 2016, p. 125.
[28]
P. Dhariwal et al., “Jukebox: A generative model for music,” 2020, arXiv:2005.00341.
[29]
F. Schneider, “ArchiSound: Audio generation with diffusion,” 2023, arXiv:2301.13267.
[30]
C. A. Huang et al., “Music transformer: Generating music with long-term structure,” in Proc. 7th Int. Conf. Learn. Representations, 2019, pp. 492–506.
[31]
Y.-S. Huang and Y.-H. Yang, “Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 1180–1188.
[32]
W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang, “Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 178–186.
[33]
Y.-J. Shih, S.-L. Wu, F. Zalkow, M. Muller, and Y.-H. Yang, “Theme transformer: Symbolic music generation with theme-conditioned transformer,” IEEE Trans. Multimedia, vol. 25, pp. 3495–3508, 2022.
[34]
Z. Hu et al., “The beauty of repetition: An algorithmic composition model with motif-level repetition generator and outline-to-music generator in symbolic music generation,” IEEE Trans. Multimedia, vol. 26, pp. 4320–4333, 2024.
[35]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6000–6010.
[36]
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in Proc. Int. Conf. Mach. Learn., PMLR, 2020, pp. 5156–5165.
[37]
L. Liu et al., “Information-enhanced network for noncontact heart rate estimation from facial videos,” IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 4, pp. 2136–2150, Apr. 2024.
[38]
S. K. Roy et al., “Multimodal fusion transformer for remote sensing image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, 2023, Art. no.
[39]
Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7291–7299.
[40]
C. Ho, W.-T. Tsai, K.-S. Lin, and H. H. Chen, “Extraction and alignment evaluation of motion beats for street dance,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2013, pp. 2429–2433.
[41]
S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proc. AAAI Conf. Artif. Intell., 2018Art. no.
[42]
P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2018, pp. 464–468.
[43]
Q. Huang et al., “MuLan: A joint embedding of music audio and natural language,” in Proc. 23rd Int. Soc. Music Inf. Retrieval Conf., 2022, pp. 559–566.
[44]
B. McFee et al., “librosa: Audio and music signal analysis in python,” in Proc. 14th Python Sci. Conf., 2015, pp. 18–25.
[45]
L. Rabiner and R. Schafer, Theory and Applications of Digital Speech Processing. Hoboken, NJ, USA: Prentice Hall Press, 2010.
[46]
H. Y. Au, J. Chen, J. Jiang, and Y. Guo, “ChoreoGraph: Music-conditioned automatic dance choreography over a style and tempo consistent dynamic graph,” in Proc. 30th ACM Int. Conf. Multimedia, 2022, pp. 3917–3925.
[47]
K. Choi, C. Hawthorne, I. Simon, M. Dinculescu, and J. Engel, “Encoding musical style with transformer autoencoders,” in Proc. Int. Conf. Mach. Learn., PMLR, 2020, pp. 1899–1908.
[48]
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in Proc. 8th Int. Conf. Learn. Representations, 2020, pp. 3679–3694.
[49]
S. Tsuchida, S. Fukayama, M. Hamasaki, and M. Goto, “AIST dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing,” in Proc. 20th Int. Soc. Music Inf. Retrieval Conf., 2019, pp. 501–510.
[50]
G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 293–302, Jul. 2002.
[51]
J. Gillick, A. Roberts, J. Engel, D. Eck, and D. Bamman, “Learning to groove with inverse sequence transformations,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 2269–2279.
[52]
H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “MuseGAN: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” in Proc. AAAI Conf. Artif. Intell., 2018,Art. no.
[53]
C. Raffel, “Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching,” Ph.D. dissertation, Columbia University, New York, NY, USA, 2016.
[54]
S. Wu and Y. Yang, “The jazz transformer on the front line: Exploring the shortcomings of AI-composed music through quantitative measures,” in Proc. 21th Int. Soc. Music Inf. Retrieval Conf., 2020, pp. 142–149.
[55]
H. Dong, K. Chen, J. J. McAuley, and T. Berg-Kirkpatrick, “MusPy: A toolkit for symbolic music generation,” in Proc. 21th Int. Soc. Music Inf. Retrieval Conf., 2020, pp. 101–108.
[56]
D. P. Ellis, “Beat tracking by dynamic programming,” J. New Music Res., vol. 36, no. 1, pp. 51–60, 2007.
[57]
S. Hershey et al., “CNN architectures for large-scale audio classification,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2017, pp. 131–135.
[58]
M. E. Davies, N. Degara, and M. D. Plumbley, “Evaluation methods for musical audio beat tracking algorithms,” Queen Mary University of London, London, U.K., Tech. Rep. C4DM-TR-09-06, 2009.
[59]
C. Raffel et al., “MIR_EVAL: A transparent implementation of common MIR metrics,” in Proc. Int. Soc. Music Inf. Retrieval, 2014, pp. 367–372.

Index Terms

  1. DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music Generator
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image IEEE Transactions on Multimedia
          IEEE Transactions on Multimedia  Volume 26, Issue
          2024
          10405 pages

          Publisher

          IEEE Press

          Publication History

          Published: 27 May 2024

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 20 Dec 2024

          Other Metrics

          Citations

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media