research-article

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

Authors:

Xinwang LiuAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3365 - 3374

https://doi.org/10.1145/3581783.3611853

Published: 27 October 2023 Publication History

Abstract

Audiovisual data is everywhere in this digital age, which raises higher requirements for the deep learning models developed on them. To well handle the information of the multi-modal data is the key to a better audiovisual modal. We observe that these audiovisual data naturally have temporal attributes, such as the time information for each frame in the video. More concretely, such data is inherently multi-modal according to both audio and visual cues, which proceed in a strict chronological order. It indicates that temporal information is important in multi-modal acoustic event modeling for both intra- and inter-modal. However, existing methods deal with each modal feature independently and simply fuse them together, which neglects the mining of temporal relation and thus leads to sub-optimal performance. With this motivation, we propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac, by modeling such temporal information via graph learning techniques. In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments. Each segment can be considered as a node, and the temporal relationships between nodes can be considered as timestamps on their edges. In this case, we can smoothly capture the dynamic information in intra-modal and inter-modal. Several experiments are conducted to demonstrate TMac outperforms other SOTA models in performance. Our code is available at https://github.com/MGitHubL/TMac.

References

[1]

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS (2021).

[2]

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. NeurIPS, Vol. 33 (2020), 25--37.

[3]

Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. 2020. Self-supervised learning by cross-modal audio-video clustering. NeurIPS (2020).

[4]

Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In ICCV.

[5]

Relja Arandjelovic and Andrew Zisserman. 2018. Objects that sound. In ECCV. 435--451.

[6]

Huriye Atilgan, Stephen M Town, Katherine C Wood, Gareth P Jones, Ross K Maddox, Adrian KC Lee, and Jennifer K Bizley. 2018. Integration of visual information in auditory cortex promotes auditory scene analysis through multisensory binding. Neuron (2018).

[7]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. NeurIPS (2016).

[8]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS (2020).

[9]

Mingyang Chen, Wen Zhang, Yuxia Geng, Zezhong Xu, Jeff Z Pan, and Huajun Chen. 2023. Generalizing to Unseen Elements: A Survey on Knowledge Extrapolation for Knowledge Graphs. arXiv preprint arXiv:2302.01859 (2023).

[10]

Dading Chong, Helin Wang, Peilin Zhou, and Qingcheng Zeng. 2022. Masked spectrogram prediction for self-supervised audio pre-training. arXiv preprint arXiv:2204.12768 (2022).

[11]

Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, and Samarjit Das. 2017. Very deep convolutional neural networks for raw waveforms. In ICASSP.

[12]

Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, and Xilin Chen. [n.d.]. Multi-modal graph neural network for joint reasoning on vision and scene text. In CVPR.

[13]

Jianfei Gao and Bruno Ribeiro. 2021. On the equivalence between temporal and static graph representations for observational predictions. arXiv preprint arXiv:2103.07016 (2021).

[14]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP.

[15]

Oguzhan Gencoglu, Tuomas Virtanen, and Heikki Huttunen. 2014. Recognition of acoustic events using deep neural networks. In EUSIPCO.

[16]

Yuan Gong, Yu-An Chung, and James Glass. 2021. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778 (2021).

[17]

Tengda Han, Weidi Xie, and Andrew Zisserman. 2020. Self-supervised co-training for video representation learning. NeurIPS (2020).

[18]

Yoonchang Han and Kyogu Lee. 2016. Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation. arXiv preprint arXiv:1607.02383 (2016).

[19]

Alan G Hawkes. 1971. Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society: Series B (Methodological) (1971).

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.

[21]

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In ICASSP.

[22]

Dayu Hu, Ke Liang, Sihang Zhou, Wenxuan Tu, Meng Liu, and Xinwang Liu. 2023. scDFC: A deep fusion clustering method for single-cell RNA-seq data. BIB (2023).

[23]

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, and Christoph Feichtenhofer. 2022. Masked autoencoders that listen. NeurIPS, Vol. 35 (2022), 28708--28720.

[24]

Jiaqi Jin, Siwei Wang, Zhibin Dong, Xinwang Liu, and En Zhu. 2023 c. Deep Incomplete Multi-view Clustering with Cross-view Partial Sample and Prototype Alignment. arXiv preprint arXiv:2303.15689 (2023).

[25]

Yiqiao Jin, Yeon-Chang Lee, Kartik Sharma, Meng Ye, Karan Sikka, Ajay Divakaran, and Srijan Kumar. 2023 a. Predicting Information Pathways Across Online Communities. In KDD.

[26]

Yeying Jin, Ruoteng Li, Wenhan Yang, and Robby T Tan. 2023 b. Estimating reflectance layer from a single image: Integrating reflectance guidance and shadow/specular aware learning. In AAAI. 1069--1077.

[27]

Yeying Jin, Aashish Sharma, and Robby T. Tan. 2021. DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised Domain-Classifier Guided Network. In ICCV. 5027--5036.

[28]

Yiqiao Jin, Xiting Wang, Yaru Hao, Yizhou Sun, and Xing Xie. 2023 d. Prototypical Fine-tuning: Towards Robust Performance Under Varying Data Sizes. In AAAI.

[29]

Yiqiao Jin, Xiting Wang, Ruichao Yang, Yizhou Sun, Wei Wang, Hao Liao, and Xing Xie. 2022a. Towards fine-grained reasoning for fake news detection. In AAAI. 5746--5754.

[30]

Yeying Jin, Wenhan Yang, and Robby T Tan. 2022b. Unsupervised night image enhancement: When layer decomposition meets light-effects suppression. In ECCV.

[31]

Hyoung-Gook Kim and Jin Young Kim. 2017. Acoustic Event Detection in Multichannel Audio Using Gated Recurrent Neural Networks with High-Resolution Spectral Features. ETRI Journal (2017).

[32]

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020).

[33]

Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative learning of audio and video models from self-supervised synchronization. NeurIPS, Vol. 31 (2018).

[34]

Khaled Koutini, Jan Schlüter, Hamid Eghbal-zadeh, and Gerhard Widmer. 2021. Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069 (2021).

[35]

Srijan Kumar, Xikun Zhang, and Jure Leskovec. 2019. Predicting dynamic embedding trajectory in temporal interaction networks. In KDD.

[36]

Jingbei Li, Yi Meng, Chenyi Li, Zhiyong Wu, Helen Meng, Chao Weng, and Dan Su. 2022. Enhancing speaking styles in conversational text-to-speech synthesis with graph-based multi-modal context modeling. In ICASSP. IEEE, 7917--7921.

[37]

Liang Li, Junpu Zhang, Siwei Wang, Xinwang Liu, Kenli Li, and Keqin Li. 2023 b. Multi-View Bipartite Graph Clustering With Coupled Noisy Feature Filter. TKDE (2023), 1--13.

[38]

Qian Li, Shu Guo, Yangyifei Luo, Cheng Ji, Lihong Wang, Jiawei Sheng, and Jianxin Li. 2023 a. Attribute-Consistent Knowledge Graph Representation Learning for Multi-Modal Entity Alignment. arXiv preprint arXiv:2304.01563 (2023).

[39]

Zixuan Li, Xiaolong Jin, Saiping Guan, Wei Li, Jiafeng Guo, Yuanzhuo Wang, and Xueqi Cheng. 2021a. Search from history and reason for future: Two-stage reasoning on temporal knowledge graphs. arXiv preprint arXiv:2106.00327 (2021).

[40]

Zixuan Li, Xiaolong Jin, Wei Li, Saiping Guan, Jiafeng Guo, Huawei Shen, Yuanzhuo Wang, and Xueqi Cheng. 2021b. Temporal knowledge graph reasoning based on evolutional representation learning. In SIGIR. 408--417.

[41]

Ke Liang, Yue Liu, Sihang Zhou, Wenxuan Tu, Yi Wen, Xihong Yang, Xiangjun Dong, and Xinwang Liu. 2023 a. Knowledge Graph Contrastive Learning Based on Relation-Symmetrical Structure. TKDE (2023).

[42]

Ke Liang, Lingyuan Meng, Meng Liu, Yue Liu, Wenxuan Tu, Siwei Wang, Sihang Zhou, Xinwang Liu, and Fuchun Sun. 2022. A Survey of Knowledge Graph Reasoning on Graph Types: Static, Dynamic, and Multimodal. (2022).

[43]

Ke Liang, Lingyuan Meng, Sihang Zhou, Siwei Wang, Wenxuan Tu, Yue Liu, Meng Liu, and Xinwang Liu. 2023 b. Message Intercommunication for Inductive Relation Reasoning. arXiv preprint arXiv:2305.14074 (2023).

[44]

Ke Liang, Sihang Zhou, Yue Liu, Lingyuan Meng, Meng Liu, and Xinwang Liu. 2023 c. Structure Guided Multi-modal Pre-trained Transformer for Knowledge Graph Reasoning. arXiv preprint arXiv:2307.03591 (2023).

[45]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. [n.d.]. Focal loss for dense object detection. In ICCV.

[46]

Meng Liu, Ke Liang, Bin Xiao, Sihang Zhou, Wenxuan Tu, Yue Liu, Xihong Yang, and Xinwang Liu. 2023 b. Self-Supervised Temporal Graph learning with Temporal and Structural Intensity Alignment. arXiv preprint arXiv:2302.07491 (2023).

[47]

Meng Liu and Yong Liu. 2021. Inductive representation learning in temporal networks via mining neighborhood and community influences. In SIGIR.

[48]

Meng Liu, Yue Liu, Ke Liang, Siwei Wang, Sihang Zhou, and Xinwang Liu. 2023 c. Deep Temporal Graph Clustering. arXiv preprint arXiv:2305.10738 (2023).

[49]

Meng Liu, Jiaming Wu, and Yong Liu. 2022b. Embedding global and local influences for dynamic graphs. In CIKM. 4249--4253.

[50]

Yue Liu, Ke Liang, Jun Xia, Sihang Zhou, Xihong Yang, Xinwang Liu, and Z. Stan Li. 2023 a. Dink-Net: Neural Clustering on Large Graphs. In ICML.

[51]

Yue Liu, Wenxuan Tu, Sihang Zhou, Xinwang Liu, Linxuan Song, Xihong Yang, and En Zhu. 2022a. Deep Graph Clustering via Dual Correlation Reduction. In AAAI. 7603--7611.

[52]

Yue Liu, Jun Xia, Sihang Zhou, Siwei Wang, Xifeng Guo, Xihong Yang, Ke Liang, Wenxuan Tu, Z. Stan Li, and Xinwang Liu. 2022c. A Survey of Deep Graph Clustering: Taxonomy, Challenge, and Application. arXiv preprint arXiv:2211.12875 (2022).

[53]

Yue Liu, Xihong Yang, Sihang Zhou, and Xinwang Liu. 2023 d. Simple contrastive graph clustering. TNNLS (2023).

[54]

Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song. 2020. Active contrastive learning of audio-visual video representations. arXiv preprint arXiv:2009.09805 (2020).

[55]

Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature (1976).

[56]

Yujie Mo, Yajie Lei, Jialie Shen, Xiaoshuang Shi, Heng Tao Shen, and Xiaofeng Zhu. 2023. Disentangled Multiplex Graph Representation Learning. In ICML.

[57]

Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance discrimination with cross-modal agreement. In CVPR. 12475--12486.

[58]

Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. 2016. Ambient sound provides supervision for visual learning. In ECCV.

[59]

Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao Schardl, and Charles Leiserson. 2020. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In AAAI.

[60]

Huy Phan, Marco Maaß, Radoslaw Mazur, and Alfred Mertins. 2014. Random regression forests for acoustic event detection and classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2014).

[61]

AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. [n.d.]. Evolving losses for unsupervised video representation learning. In CVPR.

[62]

Liang Qu, Huaisheng Zhu, Qiqi Duan, and Yuhui Shi. 2020. Continuous-time link prediction via temporal dependent graph neural network. In Proceedings of The Web Conference 2020.

Digital Library

[63]

Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. 2020. Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637 (2020).

[64]

Andrew Rouditchenko, Angie Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne, Rameswar Panda, Rogerio Feris, et al. 2020. Avlnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199 (2020).

[65]

Aaqib Saeed, David Grangier, and Neil Zeghidour. 2021. Contrastive learning of general-purpose audio representations. In ICASSP.

[66]

Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. 2020. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In WSDM.

Digital Library

[67]

Amir Shirian, Mona Ahmadian, Krishna Somandepalli, and Tanaya Guha. 2023. Heterogeneous Graph Learning for Acoustic Event Classification. arXiv preprint arXiv:2303.02665 (2023).

[68]

Amir Shirian, Krishna Somandepalli, and Tanaya Guha. 2022a. Self-supervised graphs for audio representation learning with limited labeled data. IEEE Journal of Selected Topics in Signal Processing (2022).

[69]

Amir Shirian, Krishna Somandepalli, Victor Sanchez, and Tanaya Guha. 2022b. Visually-aware Acoustic Event Detection using Heterogeneous Graphs. In Proc. Interspeech 2022.

[70]

Abhinav Shukla, Stavros Petridis, and Maja Pantic. 2020. Learning speech representations from raw audio by joint audiovisual self-supervision. arXiv preprint arXiv:2007.04134 (2020).

[71]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[72]

Guichen Tang, Ruiyu Liang, Yue Xie, Yongqiang Bao, and Shijia Wang. 2019. Improved convolutional neural networks for acoustic event classification. Multimedia Tools and Applications (2019).

[73]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR.

[74]

Xinhang Wan, Jiyuan Liu, Weixuan Liang, Xinwang Liu, Yi Wen, and En Zhu. 2022. Continual Multi-View Clustering. In ACM MM.

[75]

Xinhang Wan, Xinwang Liu, Jiyuan Liu, Siwei Wang, Yi Wen, Weixuan Liang, En Zhu, Zhe Liu, and Lu Zhou. 2023. Auto-weighted Multi-view Clustering for Large-scale Data. arxiv: 2303.01983

[76]

Xiaoyang Wang, Yao Ma, Yiqi Wang, Wei Jin, Xin Wang, Jiliang Tang, Caiyan Jia, and Jian Yu. 2020. Traffic flow prediction via spatial temporal graph neural network. In The Web Conference.

Digital Library

[77]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In ACM MM. 1437--1445.

Digital Library

[78]

Yi Wen, Siwei Wang, Qing Liao, Weixuan Liang, Ke Liang, Xinhang Wan, and Xinwang Liu. 2023. Unpaired Multi-View Graph Clustering with Cross-View Structure Matching. arXiv preprint arXiv:2307.03476 (2023).

[79]

Zhihao Wen and Yuan Fang. 2022. Trend: Temporal event and node dynamics for graph representation learning. In Proceedings of the ACM Web Conference 2022.

Digital Library

[80]

Martin Weyssow, Houari Sahraoui, and Bang Liu. [n.d.]. Better modeling the programming world with code concept graphs-augmented multi-modal learning. In ICSE.

[81]

Hanrui Wu, Jinyi Long, Nuosi Li, Dahai Yu, and Michael K Ng. 2023 a. Adversarial Auto-encoder Domain Adaptation for Cold-start Recommendation with Positive and Negative Hypergraphs. TOIS (2023), 1--25.

[82]

Hanrui Wu, Yuguang Yan, and Michael Kwok-Po Ng. 2023 b. Hypergraph Collaborative Network on Vertices and Hyperedges. TPAMI (2023), 3245--3258.

[83]

Liang Xiang, Quan Yuan, Shiwan Zhao, Li Chen, Xiatian Zhang, Qing Yang, and Jimeng Sun. 2010. Temporal recommendation on graphs via long-and short-term preference fusion. In KDD.

[84]

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2017. Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 (2017).

[85]

Ruichao Yang, Xiting Wang, Yiqiao Jin, Chaozhuo Li, Jianxun Lian, and Xing Xie. 2022d. Reinforcement subgraph reasoning for fake news detection. In KDD. 2253--2262.

[86]

Xihong Yang, Xiaochang Hu, Sihang Zhou, Xinwang Liu, and En Zhu. 2022a. Interpolation-based contrastive learning for few-label semi-supervised learning. TNNLS (2022).

[87]

Xihong Yang, Yue Liu, Sihang Zhou, Xinwang Liu, and En Zhu. 2022b. Mixed Graph Contrastive Network for Semi-Supervised Node Classification. arXiv preprint arXiv:2206.02796 (2022).

[88]

Xihong Yang, Yue Liu, Sihang Zhou, Siwei Wang, Xinwang Liu, and En Zhu. 2022c. Contrastive Deep Graph Clustering with Learnable Augmentation. arXiv preprint arXiv:2212.03559 (2022).

[89]

Xihong Yang, Yue Liu, Sihang Zhou, Siwei Wang, Wenxuan Tu, Qun Zheng, Xinwang Liu, Liming Fang, and En Zhu. 2023. Cluster-guided Contrastive Graph Clustering Network. In AAAI.

[90]

Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, and Jiebo Luo. 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. arXiv preprint arXiv:2007.08742 (2020).

[91]

Haomin Zhang, Ian McLoughlin, and Yan Song. 2015. Robust sound event recognition using convolutional neural networks. In ICASSP.

[92]

Junpu Zhang, Liang Li, Siwei Wang, Jiyuan Liu, Yue Liu, Xinwang Liu, and En Zhu. 2022. Multiple Kernel Clustering with Dual Noise Minimization. In ACM MM. 3440--3450.

[93]

Mengqi Zhang, Yuwei Xia, Qiang Liu, Shu Wu, and Liang Wang. 2023. Learning Latent Relations for Temporal Knowledge Graph Reasoning. In ACL. 12617--12631.

[94]

Hongkuan Zhou, Rajgopal Kannan, Ananthram Swami, and Viktor Prasanna. 2023. HTNet: Dynamic WLAN Performance Prediction using Heterogenous Temporal GNN. arXiv preprint arXiv:2304.10013 (2023).

[95]

Christian Zieger and Maurizio Omologo. 2008. Acoustic event classification using a distributed microphone network with a GMM/SVM combined algorithm. In Ninth Annual Conference of the International Speech Communication Association.

[96]

Yuan Zuo, Guannan Liu, Hao Lin, Jia Guo, Xiaoqian Hu, and Junjie Wu. 2018. Embedding temporal network via neighborhood formation. In KDD.

Cited By

Wei XZhao KJiao YCarlisle NXie HFonzo GZhang Y(2025)Multi-modal cross-domain self-supervised pre-training for fMRI and EEG fusionNeural Networks10.1016/j.neunet.2024.107066184(107066)Online publication date: Apr-2025
https://doi.org/10.1016/j.neunet.2024.107066
Choi HZhang LWatkins C(2025)Dual representations: A novel variant of Self-Supervised Audio Spectrogram Transformer with multi-layer feature fusion and pooling combinations for sound classificationNeurocomputing10.1016/j.neucom.2025.129415623(129415)Online publication date: Mar-2025
https://doi.org/10.1016/j.neucom.2025.129415
Yu TWang JLuo JWang JZhou G(2025)TACL: A Trusted Action-enhanced Curriculum Learning Approach to Multimodal Affective ComputingNeurocomputing10.1016/j.neucom.2024.129195620(129195)Online publication date: Mar-2025
https://doi.org/10.1016/j.neucom.2024.129195
Show More Cited By

Index Terms

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
  2. Information systems applications
    1. Multimedia information systems

Recommendations

Matching pursuit based robust acoustic event classification for surveillance systems

Acoustic event classification using matching pursuit and random forest.Event based feature extraction using atom time and frequency information.Superior in classification of human scream from other classes. Display Omitted The ability to automatically ...
Acoustic Event Mixing to Multichannel AMI Data for Distant Speech Recognition and Acoustic Event Classification Benchmarking
Speech and Computer
Abstract
Currently, the quality of Distant Speech Recognition (DSR) systems cannot match the quality of speech recognition on clean speech acquired by close-talking microphones. The main problems behind DSR are situated with the far field nature of data, ...
A Real-Time Demo for Acoustic Event Classification in Ambient Assisted Living Contexts
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

In this paper we present a real-time demo for acoustic event classification using a Convolutional Neural Network (CNN). When an acoustic event is fed as input into our system in real-time, the system performs the classification task and denotes to which ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R\&D Program of China
National Natural Science Foundation of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
393
Total Downloads

Downloads (Last 12 months)233
Downloads (Last 6 weeks)23

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wei XZhao KJiao YCarlisle NXie HFonzo GZhang Y(2025)Multi-modal cross-domain self-supervised pre-training for fMRI and EEG fusionNeural Networks10.1016/j.neunet.2024.107066184(107066)Online publication date: Apr-2025
https://doi.org/10.1016/j.neunet.2024.107066
Choi HZhang LWatkins C(2025)Dual representations: A novel variant of Self-Supervised Audio Spectrogram Transformer with multi-layer feature fusion and pooling combinations for sound classificationNeurocomputing10.1016/j.neucom.2025.129415623(129415)Online publication date: Mar-2025
https://doi.org/10.1016/j.neucom.2025.129415
Yu TWang JLuo JWang JZhou G(2025)TACL: A Trusted Action-enhanced Curriculum Learning Approach to Multimodal Affective ComputingNeurocomputing10.1016/j.neucom.2024.129195620(129195)Online publication date: Mar-2025
https://doi.org/10.1016/j.neucom.2024.129195
Lu HWang LMa XCheng JZhou M(2025)A survey of graph neural networks and their industrial applicationsNeurocomputing10.1016/j.neucom.2024.128761614(128761)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128761
Wen JDeng SWong WChao GHuang CFei LXu YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Diffusion-based missing-view generation with the application on incomplete multi-view clusteringProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694234(52762-52778)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694234
Chen XZhang JWang XZhang NWu TWang YWang YChen HLarson K(2024)Continual multimodal knowledge graph constructionProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/688(6225-6233)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/688
Tian YWang ZSun JZhang LCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Time-Frequency Domain Fusion Enhancement for Audio Super-ResolutionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681486(2879-2887)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681486
Tong YLu WZhao ZLai SShi TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MMDFND: Multi-modal Multi-Domain Fake News DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681317(1178-1186)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681317
Liang KMeng LLiu YLiu MWei WLiu STu WWang SZhou SLiu XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Simple Yet Effective: Structure Guided Pre-trained Transformer for Multi-modal Knowledge Graph ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681112(1554-1563)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681112
Gong CDemmel JYou Y(2024)Distributed and Joint Evidential K-Nearest Neighbor ClassificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.334109836:11(5972-5985)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3341098
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten