Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Learning Music-Dance Representations Through Explicit-Implicit Rhythm Synchronization

Published: 09 August 2023 Publication History

Abstract

Although audio-visual representation has been proven to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated. Considering the intrinsic alignment between the cadent movement of the dancer and music rhythm, we introduce <bold>MuDaR</bold>, a novel Music-Dance <bold>R</bold>epresentation learning framework to perform the synchronization of music and dance rhythms both in explicit and implicit ways. Specifically, we derive the dance rhythms based on visual appearance and motion cues inspired by the music rhythm analysis. Then the visual rhythms are temporally aligned with the music counterparts, which are extracted by the amplitude of sound intensity. Meanwhile, we exploit the implicit coherence of rhythms implied in audio and visual streams by contrastive learning. The model learns the joint embedding by predicting the temporal consistency between audio-visual pairs. The music-dance representation, together with the capability of detecting audio and visual rhythms, can further be applied to three downstream tasks: (a) dance classification, (b) music-dance retrieval, and (c) music-dance retargeting. Extensive experiments demonstrate that our proposed framework outperforms other self-supervised methods by a large margin.

References

[1]
W. Zhuang, C. Wang, J. Chai, Y. Wang, M. Shao, and S. Xia, “Music2Dance: DanceNet for music-driven dance generation,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 18, no. 2, pp. 1–21, 2022.
[2]
H.-Y. Lee et al., “Dancing to music,” Adv. Neural Inf. Process. Syst., vol. 32, 2019.
[3]
X. Guo, Y. Zhao, and J. Li, “DanceIt: Music-inspired dancing video synthesis,” IEEE Trans. Image Process., vol. 30, pp. 5559–5572, 2021.
[4]
R. Huang et al., “Dance revolution: Long-term dance generation with music via curriculum learning,” in Proc. Int. Conf. Learn. Representations, 2021.
[5]
J. Wang, Z. Fang, and H. Zhao, “Alignnet: A unifying approach to audio-visual alignment,” in Proc. IEEE Winter Conf. Appl. Comput. Vis., 2020, pp. 3309–3317.
[6]
J. S. Chung and A. Zisserman, “Out of time: Automated lip sync in the wild,” in Proc. Asian Conf. Comput. Vis., 2016, pp. 251–263.
[7]
R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 609–617.
[8]
R. Arandjelovic and A. Zisserman, “Objects that sound,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 435–451.
[9]
B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio and video models from self-supervised synchronization,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 7774–7785.
[10]
Y. Cheng, R. Wang, Z. Pan, R. Feng, and Y. Zhang, “Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 3884–3892.
[11]
A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 631–648.
[12]
D. Hu, F. Nie, and X. Li, “Deep multimodal clustering for unsupervised audiovisual learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern, 2019, pp. 9248–9257.
[13]
S. Ma, Z. Zeng, D. McDuff, and Y. Song, “Active contrastive learning of audio-visual video representations,” in Proc. Int. Conf. Learn. Representations, 2021.
[14]
J.-B. Alayrac et al., “Self-supervised multimodal versatile networks,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 25–37.
[15]
T. Afouras, A. Owens, J. S. Chung, and A. Zisserman, “Self-supervised learning of audio-visual objects from video,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 12475–12486.
[16]
P. Morgado, N. Vasconcelos, and I. Misra, “Audio-visual instance discrimination with cross-modal agreement,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12475–12486.
[17]
H. Alwassel et al., “Self-supervised learning by cross-modal audio-video clustering,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 9758–9770.
[18]
Z. Barzelay and Y. Y. Schechner, “Harmony in motion,” in Proc. IEEE Conf. Comput. Vis. Pattern, 2007, pp. 1–8.
[19]
H. Zhao et al., “The sound of pixels,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 1735–1744.
[20]
H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1735–1744.
[21]
D. Hu et al., “Discriminative sounding objects localization via self-supervised audiovisual matching,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 10077–10087.
[22]
C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba, “Self-supervised moving vehicle tracking with stereo sound,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 7053–7062.
[23]
C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba, “Music gesture for visual sound separation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern, 2020, pp. 10478–10487.
[24]
P. Morgado, Y. Li, and N. Nvasconcelos, “Learning representations from audio-visual spatial alignment,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 4733–4744.
[25]
T. Afouras, Y. M. Asano, F. Fagan, A. Vedaldi, and F. Metze, “Self-supervised object detection from audio-visual correspondence,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10575–10586.
[26]
H.-H. Wu, M. Fuentes, and J. P. Bello, “Exploring modality-agnostic representations for music classification,” 2021, arXiv:2106.01149.
[27]
H. Liang et al., “Pirhdy: Learning pitch-, rhythm-, and dynamics-aware embeddings for symbolic music,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 574–582.
[28]
H. Zhu, Y. Niu, D. Fu, and H. Wang, “MusicBERT: A self-supervised learning of music representation,” in Proc. ACM Int. Conf. Multimedia, 2021, pp. 3955–3963.
[29]
M. Zeng et al., “MusicBERT: Symbolic music understanding with large-scale pre-training,” in Proc. Findings Assoc. Comput. Linguistics, 2021, pp. 791–800.
[30]
X. Zhang, Y. Xu, S. Yang, L. Gao, and H. Sun, “Dance generation with style embedding: Learning and transferring latent representations of dance styles,” 2021, arXiv:2104.14802.
[31]
Z. Ye et al., “ChoreoNet: Towards music to dance synthesis with choreographic action unit,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 744–752.
[32]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
[33]
Y. Xie, H. Wang, Y. Hao, and Z. Xu, “Visual rhythm prediction with feature-aligning network,” in Proc. IEEE 16th Int. Conf. Mach. Vis. Appl., 2019, pp. 1–6.
[34]
A. Davis and M. Agrawala, “Visual rhythm and beat,” in Proc. CVPRW, 2018, pp. 2532–2535.
[35]
F. Pedersoli and M. Goto, “Dance beat tracking from visual information alone,” in Proc. Int. Soc. Music Inf. Retrieval Conf, 2020, pp. 400–408.
[36]
S. Böck and G. Widmer, “Maximum filter vibrato suppression for onset detection,” in Proc. 16th Int. Conf. Digit. Audio Effects, 2013, p. 4.
[37]
B. McFee et al., “librosa: Audio and music signal analysis in Python,” in Proc. Python Sci. Conf., 2015, pp. 18–25.
[38]
D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8934–8943.
[39]
R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, “Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions,” in Proc. IEEE Conf. Comput. Vis. Pattern, 2009, pp. 1932–1939.
[40]
A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[41]
H. Xu, R. Zeng, Q. Wu, M. Tan, and C. Gan, “Cross-modal relation-aware networks for audio-visual event localization,” in Proc. ACM Int. Conf. Multimedia, 2020, pp. 3893–3901.
[42]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014.
[43]
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proc. 27th Int. Conf. Mach. Learn., 2010, pp. 807–814.
[44]
F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A. unified embedding for face recognition and clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern, 2015, pp. 815–823.
[45]
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988.
[46]
D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representations, 2015.
[47]
G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” in Proc. Conf. North Amer. Ch. Assoc. Comput. Linguistics: Human Lang. Technologies, 2016, pp. 260–270.
[48]
R. Panchendrarajan and A. Amaresan, “Bidirectional LSTM-CRF models for sequence tagging,” in Proc. 32nd Pacific Asia Conf. Lang., Inf. Comput., 2018.
[49]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. 18th Int. Conf. Mach. Learn., 2001, pp. 282–289.
[50]
D. Castro et al., “Let's dance: Learning from online dance videos,” 2018, arXiv:1801.07388.
[51]
M. Wysoczanska and T. Trzcinski, “Multimodal dance recognition,” in Proc. VISIGRAPP, 2020, pp. 558–565.
[52]
D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series,” in Proc. 3rd Int. Conf. Knowl. Discov. Data Mining Workshop, 1994, pp. 359–370.
[53]
M. Müller, “Dynamic time warping,” in Information Retrieval Music Motion. Berlin, Germany: Springer, pp. 69–84, 2007.

Cited By

View all
  • (2024)DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music GeneratorIEEE Transactions on Multimedia10.1109/TMM.2024.340573426(10237-10250)Online publication date: 1-Jan-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 26, Issue
2024
10405 pages

Publisher

IEEE Press

Publication History

Published: 09 August 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music GeneratorIEEE Transactions on Multimedia10.1109/TMM.2024.340573426(10237-10250)Online publication date: 1-Jan-2024

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media