Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Video-to-Music Recommendation Using Temporal Alignment of Segments

Published: 01 January 2023 Publication History

Abstract

We study cross-modal recommendation of musictracks to be used as soundtracks for videos. This problem is known as the music supervision task. We build on a self-supervised system that learns a content association between music and video. In addition to the adequacy of content, adequacy of structure is crucial in music supervision to obtain relevant recommendations. We propose a novel approach to significantly improve the system’s performance using structure-aware recommendation. The core idea is to consider not only the full audio-video clips, but rather shorter segments for training and inference. We find that using semantic segments and ranking the tracks according to sequence alignment costs significantly improves the results. We investigate the impact of different ranking metrics and segmentation methods.

References

[1]
V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in Proc. Eur. Conf. Comput. Vis., Glasgow, U.K., 2020, pp. 214–229. [Online]. Available: https://doi.org/10.1007/978-3-030-58548-8_13
[2]
M. Ma et al., “VLANet: Video-language alignment network for weakly-supervised video moment retrieval,” in Proc. Eur. Conf. Comput. Vis., Glasgow, U.K., 2020, pp. 156–171. [Online]. Available: https://doi.org/10.1007/978-3-030-58604-1_10
[3]
A. Noulas, G. Englebienne, and B. J. Kröse, “Multimodal speaker diarization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 1, pp. 79–93, Jan. 2012.
[4]
A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in Proc. Eur. Conf. Comput. Vis., Salt Lake City, UT, USA, 2018, pp. 631–648. [Online]. Available: https://doi.org/10.1007/978-3-030-01231-1_39
[5]
H. Zhao et al., “The sound of pixels,” in Proc. Eur. Conf. Comput. Vis., Munich, Germany, 2018, pp. 570–586. [Online]. Available: http://sound-of-pixels.csail.mit.edu
[6]
V. Anger, “Traduirela musique en peinture, L’exemple de Paul Klee et Wassily Kandinsky,” Traduction et Transmédialité (XIXe-XXIe siècles), pp. 49–69, 2021.
[7]
R. R. Shah, Y. Yu, and R. Zimmermann, “ADVISOR - personalized video soundtrack recommendation by late fusion with heuristic rankings,” in Proc. 22nd ACM Int. Conf. Multimedia, Orlando, FL, USA, 2014, pp. 607–616.
[8]
C. Liao, P. P. Wang, and Y. Zhang, “Mining association patterns between music and video clips in professional MTV,” in Proc. Int. Conf. Multimedia Model., Sophia Antipolis, France, 2009, pp. 401–412. [Online]. Available: https://doi.org/10.1007/978-3-540-92892-8_41
[9]
D. Zeng, Y. Yu, and K. Oyama, “Audio-visual embedding for cross-modal music video retrieval through supervised deep CCA,” in Proc. IEEE Int. Symp. Multimedia, Taichung, Taiwan, 2018, pp. 143–150.
[10]
C. Inskip, A. Macfarlane, and P. Rafferty, “Music, Movies and meaning: Communication in film-makers’ search for pre-existing music, and the implications for music information retrieval,” in Proc. Int. Conf. Music Inf. Retrieval, Philadelphia, PA, USA, 2008, p. 477. [Online]. Available: https://archives.ismir.net/ismir2008/paper/000117.pdf
[11]
S. Hong, W. Im, and H. S. Yang, “CBVMR: Content-based video-music retrieval using soft intra-modal structure constraint,” in Proc. ACM Int. Conf. Multimedia Retrieval, Yokohama, Japan, 2018, pp. 353–361.
[12]
K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” J. Mach. Learn. Res., vol. 10, no. Feb, pp. 207–244, 2009. [Online]. Available: https://www.jmlr.org/papers/volume10/weinberger09a/weinberger09a.pdf
[13]
F. Schroff and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, 2015, pp. 815–823.
[14]
L. Prétet, G. Richard, and G. Peeters, “Cross-Modal Music-Video Recommendation: A Study of Design Choices,” in Proc. Special Session RLASMP, Int. Joint Conf. Neural Netw., 2021, pp. 1–9.
[15]
L. Prétet, G. Richard, and G. Peeters, “Is there a “language of music-video clips”? A qualitative and quantitative study,” in Proc. Int. Conf. Music Inf. Retrieval, 2021. [Online]. Available: https://archives.ismir.net/ismir2021/paper/000067.pdf
[16]
C. C. S. Liem, M. Larson, and A. Hanjalic, “When music makes a scene,” Int. J. Multimedia Inf. Retrieval, vol. 2, no. 1, pp. 15–30, 3 2013.
[17]
S. Sasaki, T. Hirai, H. Ohya, and S. Morishima, “Affective music recommendation system based on the mood of input video,” in Proc. Int. Conf. Multimedia Model., Sydney, Australia, 2015, pp. 299–302.
[18]
K.-H. Shin and I.-K. Lee, “Music synchronization with video using emotion similarity,” in Proc. IEEE Int Conf. Big Data Smart Comput., Jeju Island, South Korea, 2017, pp. 47–50.
[19]
C.-C. Hsia, K.-H. Lai, Y. Chen, C.-J. Wang, and M.-F. Tsai, “Representation learning for image-based music recommendation,” in Proc. Late-Breaking Results ACM Conf. Recommender Syst., 2018. [Online]. Available: http://arxiv.org/abs/1808.09198
[20]
D. Surís, A. Duarte, A. Salvador, J. Torres, and X. Giró-I-Nieto, “Cross-modal embeddings for video and audio retrieval,” in Proc. Eur. Conf. Comput. Vis. Workshops,, Munich, Germany, 2018. [Online]. Available: https://doi.org/10.1007/978-3-030-11018-5_62
[21]
B. Li and A. Kumar, “Query by video: Cross-modal music retrieval,” in Proc. Int. Conf. Music Inf. Retrieval, Delft, The Netherlands, 2019, pp. 604–611. [Online]. Available: https://archives.ismir.net/ismir2019/paper/000073.pdf
[22]
L. Shang, D. Zhang, J. Shen, E. L. Marmion, and D. Wang, “CCMR: A classic-enriched connotation-aware music retrieval system on social media with visual inputs,” Social Netw. Anal. Mining, vol. 11, no. 1, pp. 1–14, 2021.
[23]
Y. Aytar, C. Vondrick, and A. Torralba, “SoundNet: Learning sound representations from unlabeled video,” Adv. Neural Inf. Process. Syst., vol. 29, pp. 892–900, 2016.
[24]
R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in Proc. IEEE Int. Conf. Comput. Vis., Venice, Italy, 2017, pp. 609–617.
[25]
F.-F. Kuo, M.-K. Shan, and S.-Y. Lee, “Background music recommendation for video based on multimodal latent semantic analysis,” in Proc. IEEE Int. Conf. Multimedia Expo, San Jose, CA, USA, 2013, pp. 1–6.
[26]
H. Su, F.-F. Kuo, C.-H. Chiu, Y.-J. Chou, and M.-K. Shan, “MediaEval 2013: Soundtrack selection for commercials based on content correlation modeling,” in Proc. MediaEval Multimedia Benchmark Workshop, Barcelona, Spain, 2013. [Online]. Available: http://ceur-ws.org/Vol-1043/mediaeval2013_submission_98.pdf
[27]
D. Hu et al., “Curriculum audiovisual learning,” 2020. [Online]. Available: https://arxiv.org/abs/2001.09414
[28]
Y. Yu, S. Tang, F. Raposo, and L. Chen, “Deep cross-modal correlation learning for audio and lyrics in music retrieval,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 15, no. 1, pp. 1–16, 2019.
[29]
D. Hu et al., “Discriminative sounding objects localization via self-supervised audiovisual matching,” Adv. Neural Inf. Process. Syst., vol. 33, 2021, pp. 10 077–10 087, [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/7288251b27c8f0e73f4d7f483b06a785-Abstract.html
[30]
O. Gillet, S. Essid, and G. Richard, “On the correlation of automatic audio and visual segmentations of music videos,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 3, pp. 347–355, Mar. 2007.
[31]
B. Li, K. Dinesh, C. Xu, G. Sharma, and Z. Duan, “Online audio-visual source association for chamber music performances,” Trans. Int. Soc. Music Inf. Retrieval, vol. 2, no. 1, 2019.
[32]
J. Wang, Z. Fang, and H. Zhao, “AlignNet: A unifying approach to audio-visual alignment,” in Proc. Winter Conf. Appl. Comput. Vis., Snowmass Village, CO, USA, 2020, pp. 3298–3306.
[33]
J. C. Lin, W. L. Wei, and H. M. Wang, “EMV-matchmaker: Emotional temporal course modeling and matching for automatic music video generation,” in Proc. 23rd ACM Int. Conf. Multimedia, Brisbane, Australia, 2015, pp. 899–902.
[34]
J.-C. Wang, Y.-H. Yang, I.-H. Jhuo, Y.-Y. Lin, and H.-M. Wang, “The acousticvisual emotion Guassians model for automatic generation of music video,” in Proc. 20th ACM Int. Conf. Multimedia, Nara, Japan, 2012, pp. 1379–1380.
[35]
J. C. Lin, W. L. Wei, J. Yang, H. M. Wang, and H. Y. M. Liao, “Automatic music video generation based on simultaneous soundtrack recommendation and video editing,” in Proc. 25th Int. Conf. Multimedia Model., Reykjavik, Iceland, 2017, pp. 519–527.
[36]
Z. Cheng and J. Shen, “On effective location-aware music recommendation,” ACM Trans. Inf. Syst., vol. 34, no. 2, pp. 1–32, 2016.
[37]
J. Wang, E. Chng, C. Xu, H. Lu, and Q. Tian, “Generation of personalized music sports video using multimodal cues,” IEEE Trans. Multimedia, vol. 9, pp. 576–588, 2007.
[38]
J. Wang, E. Chng, and C. Xu, “Fully and semi-automatic music sports video composition,” in Proc. Int. Conf. Multimedia Expo, Toronto, ON, Canada, 2006, pp. 1897–1900.
[39]
X.-S. Hua, L. Lu, and H.-J. Zhang, “Automatic music video generation based on temporal pattern analysis,” in Proc. 12th Annu. ACM Int. Conf. Multimedia, New York City, NY, USA, 2004, pp. 472–475.
[40]
S. Abu-El-Haija et al., “Youtube-8 M: A large-scale video classification benchmark,” 2016. [Online]. Available: https://arxiv.org/abs/1609.08675
[41]
B. McFee et al., “librosa: Audio and music signal analysis in python,” in Proc. 14th Python Sci. Conf., Austin, TX, USA, 2015, pp. 18–25.
[42]
J. F. Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., New Orleans, LA, USA, 2017, pp. 776–780.
[43]
J. Foote, “Automatic audio segmentation using a measure of audio novelty,” in Proc. Int. Conf. Multimedia Expo, New York City, NY, USA, 2000, pp. 452–455.
[44]
J. Serra, M. Müller, P. Grosche, and J. L. Arcos, “Unsupervised detection of music boundaries by time series structure features,” in Proc. AAAI Conf. Artif. Intell., Toronto, ON, Canada, 2012, pp. 1613–1619. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/8328
[45]
M. Goto, “A chorus section detection method for musical audio signals and its application to a music listening station,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 5, pp. 1783–1794, Sep. 2006.
[46]
B. McFee and D. P. Ellis, “Learning to segment songs with ordinal linear discriminant analysis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Florence, Italy, 2014, pp. 5197–5201.
[47]
T. Souček, J. Moravec, and J. Lokoč, “TransNet: A deep network for fast detection of common shot transitions,” 2019. [Online]. Available: https://arxiv.org/abs/1906.03363
[48]
O. Nieto and J. P. Bello, “Systematic exploration of computational music structure research,” in Proc. Int. Conf. Music Inf. Retrieval, New York City, NY, USA, 2016, pp. 547–553. [Online]. Available: https://archives.ismir.net/ismir2016/paper/000043.pdf
[49]
S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” J. Mol. Biol., vol. 48, no. 3, pp. 443–453, 1970.
[50]
T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” J. Mol. Biol., vol. 147, no. 1, pp. 195–197, 1981. [Online]. Available: http://www.gersteinlab.org/courses/452/09-spring/pdf/sw.pdf
[51]
A. Schindler and A. Rauber, “Harnessing music-related visual stereotypes for music information retrieval,” ACM Trans. Intell. Syst. Technol., vol. 8, no. 2, pp. 1–21, 2016.
[52]
A. Schindler, “Multi-modal music information retrieval: Augmenting audio-analysis with visual computing for improved music video analysis,” Ph.D. dissertation, Technische Universität Wien, 2019. [Online]. Available: https://arxiv.org/abs/2002.00251

Cited By

View all
  • (2024)Deconfounded Cross-modal Matching for Content-based Micro-video Background Music RecommendationACM Transactions on Intelligent Systems and Technology10.1145/365004215:3(1-25)Online publication date: 6-Mar-2024
  • (2024)Category-based and Popularity-guided Video Game Recommendation: A Balance-oriented FrameworkProceedings of the ACM Web Conference 202410.1145/3589334.3645573(3734-3744)Online publication date: 13-May-2024
  • (2024)DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music GeneratorIEEE Transactions on Multimedia10.1109/TMM.2024.340573426(10237-10250)Online publication date: 1-Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 25, Issue
2023
8932 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Deconfounded Cross-modal Matching for Content-based Micro-video Background Music RecommendationACM Transactions on Intelligent Systems and Technology10.1145/365004215:3(1-25)Online publication date: 6-Mar-2024
  • (2024)Category-based and Popularity-guided Video Game Recommendation: A Balance-oriented FrameworkProceedings of the ACM Web Conference 202410.1145/3589334.3645573(3734-3744)Online publication date: 13-May-2024
  • (2024)DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music GeneratorIEEE Transactions on Multimedia10.1109/TMM.2024.340573426(10237-10250)Online publication date: 1-Jan-2024
  • (2024)Drawlody: Sketch-Based Melody Creation With Enhanced Usability and InterpretabilityIEEE Transactions on Multimedia10.1109/TMM.2024.336069526(7074-7088)Online publication date: 31-Jan-2024
  • (2023)YuYin: a multi-task learning model of multi-modal e-commerce background music recommendationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-023-00306-62023:1Online publication date: 19-Oct-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media