Nothing Special   »   [go: up one dir, main page]

Skip to main content

Multimodal Isotropic Neural Architecture with Patch Embedding

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14447))

Included in the following conference series:

Abstract

Patch embedding has been a significant advancement in Transformer-based models, particularly the Vision Transformer (ViT), as it enables handling larger image sizes and mitigating the quadratic runtime of self-attention layers in Transformers. Moreover, it allows for capturing global dependencies and relationships between patches, enhancing effective image understanding and analysis. However, it is important to acknowledge that Convolutional Neural Networks (CNNs) continue to excel in scenarios with limited data availability. Their efficiency in terms of memory usage and latency makes them particularly suitable for deployment on edge devices. Expanding upon this, we propose Minape, a novel multimodal isotropic convolutional neural architecture that incorporates patch embedding to both time series and image data for classification purposes. By employing isotropic models, Minape addresses the challenges posed by varying data sizes and complexities of the data. It groups samples based on modality type, creating two-dimensional representations that undergo linear embedding before being processed by a scalable isotropic convolutional network architecture. The outputs of these pathways are merged and fed to a temporal classifier. Experimental results demonstrate that Minape significantly outperforms existing approaches in terms of accuracy while requiring fewer than 1M parameters and occupying less than 12 MB in size. This performance was observed on multimodal benchmark datasets and the authors’ newly collected multi-dimensional multimodal dataset, Mudestreda, obtained from real industrial processing devices\(^{1}\)(\(^{1}\)Link to code and dataset: https://github.com/hubtru/Minape).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    All experiments were performed on a single Nvidia GTX1080Ti 12 GB GPU.

  2. 2.

    https://github.com/martinsbruveris/tensorflow-image-models.

  3. 3.

    https://github.com/locuslab/TCN.

  4. 4.

    https://github.com/skywaLKer518/MultiplicativeMultimodal.

  5. 5.

    https://github.com/idearibosome/embracenet.

  6. 6.

    https://github.com/keras-team/keras-io/blob/master/examples/vision/vit_small_ds.py.

  7. 7.

    https://github.com/keras-team/keras-io/blob/master/examples/vision/vit_small_ds.py.

  8. 8.

    Further ablation studies on the impact of hyperparameters can be found at https://github.com/hubtru/Minape.

References

  1. van Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: action segmentation with shared-private representation of multiple data sources. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2384–2393 (2023)

    Google Scholar 

  2. Aslam, M.H., Zeeshan, M.O., Pedersoli, M., Koerich, A.L., Bacon, S., Granger, E.: Privileged knowledge distillation for dimensional emotion recognition in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3337–3346 (2023)

    Google Scholar 

  3. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271 (2018)

  4. Bonner, L.E.R., Buhl, D.D., Kristensen, K., Navarro-Guerrero, N.: Au dataset for visuo-haptic object recognition for robots. arXiv preprint arXiv:2112.13761 (2021)

  5. Chen, S., Guhur, P.L., Schmid, C., Laptev, I.: History aware multimodal transformer for vision-and-language navigation. Adv. Neural Inform. Process. Syst. (NeurIPS) 34, 5834–5847 (2021)

    Google Scholar 

  6. Choi, J.H., Lee, J.S.: Embracenet: a robust deep learning architecture for multimodal classification. Inform. Fusion 51, 259–270 (2019)

    Article  Google Scholar 

  7. Cicirelli, G., et al.: The ha4m dataset: multi-modal monitoring of an assembly task for human action recognition in manufacturing. Sci. Data 9(1), 745 (2022)

    Article  Google Scholar 

  8. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)

  9. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representation (ICLR) (2021)

    Google Scholar 

  10. Eroglu Erdem, C., Turan, C., Aydin, Z.: Baum-2: a multilingual audio-visual affective face database. Multimed. Tools Appl. 74(18), 7429–7459 (2015)

    Article  Google Scholar 

  11. Gashi, S., Min, C., Montanari, A., Santini, S., Kawsar, F.: A multidevice and multimodal dataset for human energy expenditure estimation using wearable devices. Sci. Data 9(1), 537 (2022)

    Article  Google Scholar 

  12. Geng, T., Wang, T., Duan, J., Cong, R., Zheng, F.: Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22942–22951 (2023)

    Google Scholar 

  13. Girdhar, R., et al.: Imagebind: one embedding space to bind them all. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15180–15190 (2023)

    Google Scholar 

  14. Gong, X., et al.: MMG-ego4D: multimodal generalization in egocentric action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6481–6491 (2023)

    Google Scholar 

  15. Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)

  16. Lee, S.H., Lee, S., Song, B.C.: Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492 (2021)

  17. Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient multimodal fusion via interactive prompting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2604–2613 (2023)

    Google Scholar 

  18. Lialin, V., Rawls, S., Chan, D., Ghosh, S., Rumshisky, A., Hamza, W.: Scalable and accurate self-supervised multimodal representation learning without aligned video and text data. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 390–400 (2023)

    Google Scholar 

  19. Lin, Y.B., Sung, Y.L., Lei, J., Bansal, M., Bertasius, G.: Vision transformers are parameter-efficient audio-visual learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2299–2309 (2023)

    Google Scholar 

  20. Lin, Y.B., Tseng, H.Y., Lee, H.Y., Lin, Y.Y., Yang, M.H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. Adv. Neural Inform. Process. Syst. (NeurIPS) 34, 11449–11461 (2021)

    Google Scholar 

  21. Liu, K., Li, Y., Xu, N., Natarajan, P.: Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730 (2018)

  22. Liu, X., Lu, H., Yuan, J., Li, X.: Cat: causal audio transformer for audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

  23. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representation (ICLR) (2018)

    Google Scholar 

  24. Ramazanova, M., Escorcia, V., Caba, F., Zhao, C., Ghanem, B.: Owl (observe, watch, listen): Audiovisual temporal context for localizing actions in egocentric videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4879–4889 (2023)

    Google Scholar 

  25. Ranganathan, H., Chakraborty, S., Panchanathan, S.: Multimodal emotion recognition using deep learning architectures. In: IEEE winter conference on Applications of Computer Vision (WACV), pp. 1–9 (2016)

    Google Scholar 

  26. Ryan, F., Jiang, H., Shukla, A., Rehg, J.M., Ithapu, V.K.: Egocentric auditory attention localization in conversations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14663–14674 (2023)

    Google Scholar 

  27. Senocak, A., Kim, J., Oh, T.H., Li, D., Kweon, I.S.: Event-specific audio-visual fusion layers: a simple and new perspective on video understanding. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2237–2247 (2023)

    Google Scholar 

  28. Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: an improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)

  29. Wijekoon, A., Wiratunga, N., Cooper, K.: Mex: multi-modal exercises dataset for human activity recognition. arXiv preprint arXiv:1908.08992 (2019)

  30. Wu, H., et al.: Cvt: introducing convolutions to vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22–31 (2021)

    Google Scholar 

  31. Xiao, Y., Ma, Y., Li, S., Zhou, H., Liao, R., Li, X.: Semanticac: semantics-assisted framework for audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

  32. Xu, R., Feng, R., Zhang, S.X., Hu, D.: Mmcosine: multi-modal cosine loss towards balanced audio-visual fine-grained learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

  33. Xue, Z., Marculescu, R.: Dynamic multimodal fusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2583 (2023)

    Google Scholar 

  34. Zhang, X., Tang, X., Zong, L., Liu, X., Mu, J.: Deep multimodal clustering with cross reconstruction. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 305–317 (2020)

    Google Scholar 

  35. Zhang, Z., et al.: Abaw5 challenge: a facial affect recognition approach utilizing transformer encoder and audiovisual fusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5724–5733 (2023)

    Google Scholar 

  36. Zhong, Z., Schneider, D., Voit, M., Stiefelhagen, R., Beyerer, J.: Anticipative feature fusion transformer for multi-modal action anticipation. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6068–6077 (2023)

    Google Scholar 

  37. Zhu, W., Omar, M.: Multiscale audio spectrogram transformer for efficient audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zahra Ahmadi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Truchan, H., Naumov, E., Abedin, R., Palmer, G., Ahmadi, Z. (2024). Multimodal Isotropic Neural Architecture with Patch Embedding. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14447. Springer, Singapore. https://doi.org/10.1007/978-981-99-8079-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8079-6_14

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8078-9

  • Online ISBN: 978-981-99-8079-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics