Abstract
Patch embedding has been a significant advancement in Transformer-based models, particularly the Vision Transformer (ViT), as it enables handling larger image sizes and mitigating the quadratic runtime of self-attention layers in Transformers. Moreover, it allows for capturing global dependencies and relationships between patches, enhancing effective image understanding and analysis. However, it is important to acknowledge that Convolutional Neural Networks (CNNs) continue to excel in scenarios with limited data availability. Their efficiency in terms of memory usage and latency makes them particularly suitable for deployment on edge devices. Expanding upon this, we propose Minape, a novel multimodal isotropic convolutional neural architecture that incorporates patch embedding to both time series and image data for classification purposes. By employing isotropic models, Minape addresses the challenges posed by varying data sizes and complexities of the data. It groups samples based on modality type, creating two-dimensional representations that undergo linear embedding before being processed by a scalable isotropic convolutional network architecture. The outputs of these pathways are merged and fed to a temporal classifier. Experimental results demonstrate that Minape significantly outperforms existing approaches in terms of accuracy while requiring fewer than 1M parameters and occupying less than 12 MB in size. This performance was observed on multimodal benchmark datasets and the authors’ newly collected multi-dimensional multimodal dataset, Mudestreda, obtained from real industrial processing devices\(^{1}\)(\(^{1}\)Link to code and dataset: https://github.com/hubtru/Minape).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
All experiments were performed on a single Nvidia GTX1080Ti 12 GB GPU.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
Further ablation studies on the impact of hyperparameters can be found at https://github.com/hubtru/Minape.
References
van Amsterdam, B., Kadkhodamohammadi, A., Luengo, I., Stoyanov, D.: Aspnet: action segmentation with shared-private representation of multiple data sources. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2384–2393 (2023)
Aslam, M.H., Zeeshan, M.O., Pedersoli, M., Koerich, A.L., Bacon, S., Granger, E.: Privileged knowledge distillation for dimensional emotion recognition in the wild. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3337–3346 (2023)
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271 (2018)
Bonner, L.E.R., Buhl, D.D., Kristensen, K., Navarro-Guerrero, N.: Au dataset for visuo-haptic object recognition for robots. arXiv preprint arXiv:2112.13761 (2021)
Chen, S., Guhur, P.L., Schmid, C., Laptev, I.: History aware multimodal transformer for vision-and-language navigation. Adv. Neural Inform. Process. Syst. (NeurIPS) 34, 5834–5847 (2021)
Choi, J.H., Lee, J.S.: Embracenet: a robust deep learning architecture for multimodal classification. Inform. Fusion 51, 259–270 (2019)
Cicirelli, G., et al.: The ha4m dataset: multi-modal monitoring of an assembly task for human action recognition in manufacturing. Sci. Data 9(1), 745 (2022)
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representation (ICLR) (2021)
Eroglu Erdem, C., Turan, C., Aydin, Z.: Baum-2: a multilingual audio-visual affective face database. Multimed. Tools Appl. 74(18), 7429–7459 (2015)
Gashi, S., Min, C., Montanari, A., Santini, S., Kawsar, F.: A multidevice and multimodal dataset for human energy expenditure estimation using wearable devices. Sci. Data 9(1), 537 (2022)
Geng, T., Wang, T., Duan, J., Cong, R., Zheng, F.: Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22942–22951 (2023)
Girdhar, R., et al.: Imagebind: one embedding space to bind them all. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15180–15190 (2023)
Gong, X., et al.: MMG-ego4D: multimodal generalization in egocentric action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6481–6491 (2023)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)
Lee, S.H., Lee, S., Song, B.C.: Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492 (2021)
Li, Y., Quan, R., Zhu, L., Yang, Y.: Efficient multimodal fusion via interactive prompting. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2604–2613 (2023)
Lialin, V., Rawls, S., Chan, D., Ghosh, S., Rumshisky, A., Hamza, W.: Scalable and accurate self-supervised multimodal representation learning without aligned video and text data. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 390–400 (2023)
Lin, Y.B., Sung, Y.L., Lei, J., Bansal, M., Bertasius, G.: Vision transformers are parameter-efficient audio-visual learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2299–2309 (2023)
Lin, Y.B., Tseng, H.Y., Lee, H.Y., Lin, Y.Y., Yang, M.H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. Adv. Neural Inform. Process. Syst. (NeurIPS) 34, 11449–11461 (2021)
Liu, K., Li, Y., Xu, N., Natarajan, P.: Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730 (2018)
Liu, X., Lu, H., Yuan, J., Li, X.: Cat: causal audio transformer for audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representation (ICLR) (2018)
Ramazanova, M., Escorcia, V., Caba, F., Zhao, C., Ghanem, B.: Owl (observe, watch, listen): Audiovisual temporal context for localizing actions in egocentric videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4879–4889 (2023)
Ranganathan, H., Chakraborty, S., Panchanathan, S.: Multimodal emotion recognition using deep learning architectures. In: IEEE winter conference on Applications of Computer Vision (WACV), pp. 1–9 (2016)
Ryan, F., Jiang, H., Shukla, A., Rehg, J.M., Ithapu, V.K.: Egocentric auditory attention localization in conversations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14663–14674 (2023)
Senocak, A., Kim, J., Oh, T.H., Li, D., Kweon, I.S.: Event-specific audio-visual fusion layers: a simple and new perspective on video understanding. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2237–2247 (2023)
Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: an improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)
Wijekoon, A., Wiratunga, N., Cooper, K.: Mex: multi-modal exercises dataset for human activity recognition. arXiv preprint arXiv:1908.08992 (2019)
Wu, H., et al.: Cvt: introducing convolutions to vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22–31 (2021)
Xiao, Y., Ma, Y., Li, S., Zhou, H., Liao, R., Li, X.: Semanticac: semantics-assisted framework for audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Xu, R., Feng, R., Zhang, S.X., Hu, D.: Mmcosine: multi-modal cosine loss towards balanced audio-visual fine-grained learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Xue, Z., Marculescu, R.: Dynamic multimodal fusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2583 (2023)
Zhang, X., Tang, X., Zong, L., Liu, X., Mu, J.: Deep multimodal clustering with cross reconstruction. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 305–317 (2020)
Zhang, Z., et al.: Abaw5 challenge: a facial affect recognition approach utilizing transformer encoder and audiovisual fusion. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5724–5733 (2023)
Zhong, Z., Schneider, D., Voit, M., Stiefelhagen, R., Beyerer, J.: Anticipative feature fusion transformer for multi-modal action anticipation. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6068–6077 (2023)
Zhu, W., Omar, M.: Multiscale audio spectrogram transformer for efficient audio classification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Truchan, H., Naumov, E., Abedin, R., Palmer, G., Ahmadi, Z. (2024). Multimodal Isotropic Neural Architecture with Patch Embedding. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14447. Springer, Singapore. https://doi.org/10.1007/978-981-99-8079-6_14
Download citation
DOI: https://doi.org/10.1007/978-981-99-8079-6_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8078-9
Online ISBN: 978-981-99-8079-6
eBook Packages: Computer ScienceComputer Science (R0)