short-paper

Multimodal Learning via Exploring Deep Semantic Similarity

Authors:

Xuelong LiAuthors Info & Claims

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 342 - 346

https://doi.org/10.1145/2964284.2967239

Published: 01 October 2016 Publication History

Abstract

Deep learning is skilled at learning representation from raw data, which are embedded in the semantic space. Traditional multimodal networks take advantage of this, and maximize the joint distribution over the representations of different modalities. However, the similarity among the representations are not emphasized, which is an important property for multimodal data. In this paper, we will introduce a novel learning method for multimodal networks, named as Semantic Similarity Learning (SSL), which aims at training the model via enhancing the similarity between the high-level features of different modalities. Sets of experiments are conducted for evaluating the method on different multimodal networks and multiple tasks. The experimental results demonstrate the effectiveness of SSL in keeping the shared information and improving the discrimination. Particularly, SSL shows its ability in encouraging each modality to learn transferred knowledge from the other one when faced with missing data.

References

[1]

G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In ICML, pages 1247--1255, 2013.

Digital Library

[2]

Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Transactions on PAMI, 35(8):1798--1828, 2013.

Digital Library

[3]

S. J. Cox, R. Harvey, Y. Lan, J. L. Newman, and B.-J. Theobald. The challenge of multispeaker lip-reading. In AVSP, pages 179--184, 2008.

[4]

G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771--1800, 2002.

Digital Library

[5]

H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321--377, 1936.

[6]

D. Hu, X. Li, and X. Lu. Temporal multimodal learning in audiovisual speech recognition. In CVPR, 2016.

[7]

J. Huang and B. Kingsbury. Audio-visual deep learning for noise robust speech recognition. In ICASSP, pages 7596--7599, 2013.

[8]

M. J. Huiskes and M. S. Lew. The mir flickr retrieval evaluation. In ACM MIR, pages 39--43, 2008.

Digital Library

[9]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, 1998.

[10]

K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate analysis. 1980.

[11]

J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, pages 689--696, 2011.

Digital Library

[12]

Y. Pang, Z. Ma, Y. Yuan, X. Li, and K. Wang. Multimodal learning for multi-label image classification. In ICIP, pages 1797--1800, 2011.

[13]

X. Shu, G.-J. Qi, J. Tang, and J. Wang. Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation. In ACM Multimedia, pages 35--44, 2015.

Digital Library

[14]

K. Sohn, W. Shang, and H. Lee. Improved multimodal deep learning with variation of information. In NIPS, pages 2141--2149, 2014.

Digital Library

[15]

N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, pages 2222--2230, 2012.

Digital Library

[16]

P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, pages 1096--1103, 2008.

Digital Library

[17]

A. Wang, J. Lu, G. Wang, J. Cai, and T.-J. Cham. Multi-modal unsupervised feature learning for rgb-d scene labeling. In ECCV, pages 453--467. 2014.

[18]

W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multi-view representation learning. In ICML, pages 1083--1092, 2015.

Digital Library

[19]

Y. Zhen, Y. Gao, D.-Y. Yeung, H. Zha, and X. Li. Spectral multimodal hashing and its application to multimedia retrieval. IEEE Transactions on cybernetics, 46(1):27--38, 2016.

Cited By

Ye PXiao GLiu J(2024)Multimodal Features Alignment for Vision–Language Object TrackingRemote Sensing10.3390/rs1607116816:7(1168)Online publication date: 27-Mar-2024
https://doi.org/10.3390/rs16071168
李学(2023)Multi-modal cognitive computingSCIENTIA SINICA Informationis10.1360/SSI-2022-022653:1(1)Online publication date: 11-Jan-2023
https://doi.org/10.1360/SSI-2022-0226
Yuan YNing HLu X(2021)Bio-Inspired Representation Learning for Visual Attention PredictionIEEE Transactions on Cybernetics10.1109/TCYB.2019.293173551:7(3562-3575)Online publication date: Jul-2021
https://doi.org/10.1109/TCYB.2019.2931735
Show More Cited By

Index Terms

Multimodal Learning via Exploring Deep Semantic Similarity
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

A Review on Methods and Applications in Multimodal Deep Learning
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the ...
Multimodal federated learning: Concept, methods, applications and future directions
Abstract
Multimodal learning mines and analyzes multimodal data in reality to better understand and appreciate the world around people. However, how to exploit this rich multimodal data without violating user privacy is a key issue. Federated learning is ...
Highlights
- The three different modes in the multimodal federated learning model are summarized.
- Multimodal fusion based on the federated learning framework is also specified.
- The difficulties and some ideas of multimodal federated learning ...
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
Abstract
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '16: Proceedings of the 24th ACM international conference on Multimedia

October 2016

1542 pages

ISBN:9781450336031

DOI:10.1145/2964284

General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

National Basic Research Program of China (973 Program)
National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities
Open Research Fund of Key Laboratory of Spectral Imaging Technology, Chinese Academy of Sciences
State Key Program of National Natural Science of China

Conference

MM '16

Sponsor:

SIGMM

MM '16: ACM Multimedia Conference

October 15 - 19, 2016

Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
432
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)3

Reflects downloads up to 22 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ye PXiao GLiu J(2024)Multimodal Features Alignment for Vision–Language Object TrackingRemote Sensing10.3390/rs1607116816:7(1168)Online publication date: 27-Mar-2024
https://doi.org/10.3390/rs16071168
李学(2023)Multi-modal cognitive computingSCIENTIA SINICA Informationis10.1360/SSI-2022-022653:1(1)Online publication date: 11-Jan-2023
https://doi.org/10.1360/SSI-2022-0226
Yuan YNing HLu X(2021)Bio-Inspired Representation Learning for Visual Attention PredictionIEEE Transactions on Cybernetics10.1109/TCYB.2019.293173551:7(3562-3575)Online publication date: Jul-2021
https://doi.org/10.1109/TCYB.2019.2931735
Liang MDu JYang CXue ZLi HKou FGeng Y(2020)Cross-Media Semantic Correlation Learning Based on Deep Hash Network and Semantic Expansion for Social Network Cross-Media SearchIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2019.294556731:9(3634-3648)Online publication date: Sep-2020
https://doi.org/10.1109/TNNLS.2019.2945567
Covaci ASaleme EMesfin GHussain NKani-Zabihi EGhinea G(2020)How Do We Experience Crossmodal Correspondent Mulsemedia Content?IEEE Transactions on Multimedia10.1109/TMM.2019.294127422:5(1249-1258)Online publication date: May-2020
https://doi.org/10.1109/TMM.2019.2941274
Wang QWan JLi X(2019)Robust Hierarchical Deep Learning for Vehicular ManagementIEEE Transactions on Vehicular Technology10.1109/TVT.2018.288304668:5(4148-4156)Online publication date: May-2019
https://doi.org/10.1109/TVT.2018.2883046
Hu DNie FLi X(2019)Deep Binary Reconstruction for Cross-Modal HashingIEEE Transactions on Multimedia10.1109/TMM.2018.286677121:4(973-985)Online publication date: Apr-2019
https://doi.org/10.1109/TMM.2018.2866771
Le NOdobez J(2019)Improving speech embedding using crossmodal transfer learning with audio-visual dataMultimedia Tools and Applications10.1007/s11042-018-6992-378:11(15681-15704)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s11042-018-6992-3
Li XHu DNie FLiu QLienhart RWang HChen SBoll SChen PFriedland GLi JYan S(2017)Deep Binary Reconstruction for Cross-modal HashingProceedings of the 25th ACM international conference on Multimedia10.1145/3123266.3123355(1398-1406)Online publication date: 23-Oct-2017
https://dl.acm.org/doi/10.1145/3123266.3123355
Le NOdobez J(2017)Improving Speaker Turn Embedding by Crossmodal Transfer Learning from Face Embedding2017 IEEE International Conference on Computer Vision Workshops (ICCVW)10.1109/ICCVW.2017.58(428-437)Online publication date: Oct-2017
https://doi.org/10.1109/ICCVW.2017.58
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents