Multimodal Isotropic Neural Architecture with Patch Embedding

Hubert Truchan¹²,
Evgenii Naumov¹²,
Rezaul Abedin¹²,
Gregory Palmer¹² &
…
Zahra Ahmadi¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14447))

Included in the following conference series:

International Conference on Neural Information Processing

1661 Accesses
1 Citations

Abstract

Patch embedding has been a significant advancement in Transformer-based models, particularly the Vision Transformer (ViT), as it enables handling larger image sizes and mitigating the quadratic runtime of self-attention layers in Transformers. Moreover, it allows for capturing global dependencies and relationships between patches, enhancing effective image understanding and analysis. However, it is important to acknowledge that Convolutional Neural Networks (CNNs) continue to excel in scenarios with limited data availability. Their efficiency in terms of memory usage and latency makes them particularly suitable for deployment on edge devices. Expanding upon this, we propose Minape, a novel multimodal isotropic convolutional neural architecture that incorporates patch embedding to both time series and image data for classification purposes. By employing isotropic models, Minape addresses the challenges posed by varying data sizes and complexities of the data. It groups samples based on modality type, creating two-dimensional representations that undergo linear embedding before being processed by a scalable isotropic convolutional network architecture. The outputs of these pathways are merged and fed to a temporal classifier. Experimental results demonstrate that Minape significantly outperforms existing approaches in terms of accuracy while requiring fewer than 1M parameters and occupying less than 12 MB in size. This performance was observed on multimodal benchmark datasets and the authors’ newly collected multi-dimensional multimodal dataset, Mudestreda, obtained from real industrial processing devices$^{1}$($^{1}$Link to code and dataset: https://github.com/hubtru/Minape).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

Extending CNN Classification Capabilities Using a Novel Feature to Image Transformation (FIT) Algorithm

ParaLkResNet: an efficient multi-scale image classification network

Article 13 June 2024

Notes

1.
All experiments were performed on a single Nvidia GTX1080Ti 12 GB GPU.
2.
https://github.com/martinsbruveris/tensorflow-image-models.
3.
https://github.com/locuslab/TCN.
4.
https://github.com/skywaLKer518/MultiplicativeMultimodal.
5.
https://github.com/idearibosome/embracenet.
6.
https://github.com/keras-team/keras-io/blob/master/examples/vision/vit_small_ds.py.
7.
https://github.com/keras-team/keras-io/blob/master/examples/vision/vit_small_ds.py.
8.
Further ablation studies on the impact of hyperparameters can be found at https://github.com/hubtru/Minape.

References

van Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: action segmentation with shared-private representation of multiple data sources. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2384–2393 (2023)
Google Scholar
Aslam, M.H., Zeeshan, M.O., Pedersoli, M., Koerich, A.L., Bacon, S., Granger, E.: Privileged knowledge distillation for dimensional emotion recognition in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3337–3346 (2023)
Google Scholar
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271 (2018)
Bonner, L.E.R., Buhl, D.D., Kristensen, K., Navarro-Guerrero, N.: Au dataset for visuo-haptic object recognition for robots. arXiv preprint arXiv:2112.13761 (2021)
Chen, S., Guhur, P.L., Schmid, C., Laptev, I.: History aware multimodal transformer for vision-and-language navigation. Adv. Neural Inform. Process. Syst. (NeurIPS) 34, 5834–5847 (2021)
Google Scholar
Choi, J.H., Lee, J.S.: Embracenet: a robust deep learning architecture for multimodal classification. Inform. Fusion 51, 259–270 (2019)
Article Google Scholar
Cicirelli, G., et al.: The ha4m dataset: multi-modal monitoring of an assembly task for human action recognition in manufacturing. Sci. Data 9(1), 745 (2022)
Article Google Scholar
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representation (ICLR) (2021)
Google Scholar
Eroglu Erdem, C., Turan, C., Aydin, Z.: Baum-2: a multilingual audio-visual affective face database. Multimed. Tools Appl. 74(18), 7429–7459 (2015)
Article Google Scholar
Gashi, S., Min, C., Montanari, A., Santini, S., Kawsar, F.: A multidevice and multimodal dataset for human energy expenditure estimation using wearable devices. Sci. Data 9(1), 537 (2022)
Article Google Scholar
Geng, T., Wang, T., Duan, J., Cong, R., Zheng, F.: Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22942–22951 (2023)
Google Scholar
Girdhar, R., et al.: Imagebind: one embedding space to bind them all. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15180–15190 (2023)
Google Scholar
Gong, X., et al.: MMG-ego4D: multimodal generalization in egocentric action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6481–6491 (2023)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)
Lee, S.H., Lee, S., Song, B.C.: Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492 (2021)
Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient multimodal fusion via interactive prompting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2604–2613 (2023)
Google Scholar
Lialin, V., Rawls, S., Chan, D., Ghosh, S., Rumshisky, A., Hamza, W.: Scalable and accurate self-supervised multimodal representation learning without aligned video and text data. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 390–400 (2023)
Google Scholar
Lin, Y.B., Sung, Y.L., Lei, J., Bansal, M., Bertasius, G.: Vision transformers are parameter-efficient audio-visual learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2299–2309 (2023)
Google Scholar
Lin, Y.B., Tseng, H.Y., Lee, H.Y., Lin, Y.Y., Yang, M.H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. Adv. Neural Inform. Process. Syst. (NeurIPS) 34, 11449–11461 (2021)
Google Scholar
Liu, K., Li, Y., Xu, N., Natarajan, P.: Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730 (2018)
Liu, X., Lu, H., Yuan, J., Li, X.: Cat: causal audio transformer for audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representation (ICLR) (2018)
Google Scholar
Ramazanova, M., Escorcia, V., Caba, F., Zhao, C., Ghanem, B.: Owl (observe, watch, listen): Audiovisual temporal context for localizing actions in egocentric videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4879–4889 (2023)
Google Scholar
Ranganathan, H., Chakraborty, S., Panchanathan, S.: Multimodal emotion recognition using deep learning architectures. In: IEEE winter conference on Applications of Computer Vision (WACV), pp. 1–9 (2016)
Google Scholar
Ryan, F., Jiang, H., Shukla, A., Rehg, J.M., Ithapu, V.K.: Egocentric auditory attention localization in conversations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14663–14674 (2023)
Google Scholar
Senocak, A., Kim, J., Oh, T.H., Li, D., Kweon, I.S.: Event-specific audio-visual fusion layers: a simple and new perspective on video understanding. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2237–2247 (2023)
Google Scholar
Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: an improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)
Wijekoon, A., Wiratunga, N., Cooper, K.: Mex: multi-modal exercises dataset for human activity recognition. arXiv preprint arXiv:1908.08992 (2019)
Wu, H., et al.: Cvt: introducing convolutions to vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22–31 (2021)
Google Scholar
Xiao, Y., Ma, Y., Li, S., Zhou, H., Liao, R., Li, X.: Semanticac: semantics-assisted framework for audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar
Xu, R., Feng, R., Zhang, S.X., Hu, D.: Mmcosine: multi-modal cosine loss towards balanced audio-visual fine-grained learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar
Xue, Z., Marculescu, R.: Dynamic multimodal fusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2583 (2023)
Google Scholar
Zhang, X., Tang, X., Zong, L., Liu, X., Mu, J.: Deep multimodal clustering with cross reconstruction. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 305–317 (2020)
Google Scholar
Zhang, Z., et al.: Abaw5 challenge: a facial affect recognition approach utilizing transformer encoder and audiovisual fusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5724–5733 (2023)
Google Scholar
Zhong, Z., Schneider, D., Voit, M., Stiefelhagen, R., Beyerer, J.: Anticipative feature fusion transformer for multi-modal action anticipation. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6068–6077 (2023)
Google Scholar
Zhu, W., Omar, M.: Multiscale audio spectrogram transformer for efficient audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

L3S Research Center, Leibniz University Hannover, Hannover, Germany
Hubert Truchan, Evgenii Naumov, Rezaul Abedin, Gregory Palmer & Zahra Ahmadi

Authors

Hubert Truchan
View author publications
You can also search for this author in PubMed Google Scholar
Evgenii Naumov
View author publications
You can also search for this author in PubMed Google Scholar
Rezaul Abedin
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Palmer
View author publications
You can also search for this author in PubMed Google Scholar
Zahra Ahmadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zahra Ahmadi .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Biao Luo
Chinese Academy of Sciences, Beijing, China
Long Cheng
Zhejiang University, Hangzhou, China
Zheng-Guang Wu
Guangdong University of Technology, Guangzhou, China
Hongyi Li
UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Truchan, H., Naumov, E., Abedin, R., Palmer, G., Ahmadi, Z. (2024). Multimodal Isotropic Neural Architecture with Patch Embedding. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14447. Springer, Singapore. https://doi.org/10.1007/978-981-99-8079-6_14

Download citation

DOI: https://doi.org/10.1007/978-981-99-8079-6_14
Published: 14 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8078-9
Online ISBN: 978-981-99-8079-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Isotropic Neural Architecture with Patch Embedding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Extending CNN Classification Capabilities Using a Novel Feature to Image Transformation (FIT) Algorithm

ParaLkResNet: an efficient multi-scale image classification network

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multimodal Isotropic Neural Architecture with Patch Embedding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Extending CNN Classification Capabilities Using a Novel Feature to Image Transformation (FIT) Algorithm

ParaLkResNet: an efficient multi-scale image classification network

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation