Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.08288 (cs)

[Submitted on 16 Aug 2023 (v1), last revised 19 Dec 2023 (this version, v2)]

Title:Improving Audio-Visual Segmentation with Bidirectional Generation

Authors:Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, Yiran Zhong

Abstract:The aim of audio-visual segmentation (AVS) is to precisely differentiate audible objects within videos down to the pixel level. Traditional approaches often tackle this challenge by combining information from various modalities, where the contribution of each modality is implicitly or explicitly modeled. Nevertheless, the interconnections between different modalities tend to be overlooked in audio-visual modeling. In this paper, inspired by the human ability to mentally simulate the sound of an object and its visual appearance, we introduce a bidirectional generation framework. This framework establishes robust correlations between an object's visual characteristics and its associated sound, thereby enhancing the performance of AVS. To achieve this, we employ a visual-to-audio projection component that reconstructs audio features from object segmentation masks and minimizes reconstruction errors. Moreover, recognizing that many sounds are linked to object movements, we introduce an implicit volumetric motion estimation module to handle temporal dynamics that may be challenging to capture using conventional optical flow methods. To showcase the effectiveness of our approach, we conduct comprehensive experiments and analyses on the widely recognized AVSBench benchmark. As a result, we establish a new state-of-the-art performance level in the AVS benchmark, particularly excelling in the challenging MS3 subset which involves segmenting multiple sound sources. To facilitate reproducibility, we plan to release both the source code and the pre-trained model.

Comments:	AAAI Camera Ready. Dawei Hao and Yuxin Mao contribute equality to this paper. Yiran Zhong is the corresponding author. The code will be released at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2308.08288 [cs.CV]
	(or arXiv:2308.08288v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.08288

Submission history

From: Yiran Zhong [view email]
[v1] Wed, 16 Aug 2023 11:20:23 UTC (891 KB)
[v2] Tue, 19 Dec 2023 07:50:23 UTC (966 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Improving Audio-Visual Segmentation with Bidirectional Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Improving Audio-Visual Segmentation with Bidirectional Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators