Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.07517 (cs)

[Submitted on 11 Oct 2023]

Title:CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

Authors:Yaru Chen, Ruohao Guo, Xubo Liu, Peipei Wu, Guangyao Li, Zhenbo Li, Wenwu Wang

View PDF

Abstract:Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events. Recent methods for this task leverage the attention mechanism to capture the semantic correlations among the whole video across the audio-visual modalities. However, these approaches have overlooked the importance of individual segments within a video and the relationship among them, and tend to rely on a single modality when learning features. In this paper, we propose a novel interactive-enhanced cross-modal perception method~(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. Furthermore, a cross-modal aggregation block is introduced to jointly optimize the semantic representation of audio and visual signals by enhancing inter-modal interactions. The experimental results show that our model offers improved parsing performance on the Look, Listen, and Parse dataset compared to other methods.

Comments:	5 pages, 3 figures, 15 references
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
MSC classes:	I.2.10, I.4.8
Cite as:	arXiv:2310.07517 [cs.CV]
	(or arXiv:2310.07517v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.07517

Submission history

From: Yaru Chen [view email]
[v1] Wed, 11 Oct 2023 14:15:25 UTC (2,651 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators