Abstract
Real-time English speech translation is useful in numerous situations, including business and travel. The goal of this research is to improve real-time English speech translation efficacy. Initially, filter bank (FBank) features were extracted from English speech. Subsequently, an enhanced Transformer model was introduced, incorporating a causal convolution module in the front end of the encoder to capture English speech features with location information. The performance of the optimized model in translating English speech to different target languages was tested using the MuST-C dataset. The results revealed differences in translation results for different target languages using the improved Transformer. The highest bilingual evaluation understudy (BLEU) score was observed for Spanish text at 20.84, while Russian text obtained the lowest score of 10.56. The average BLEU score was 18.51, with an average lag time delay of 1202.33 ms. Compared to the conventional Transformer model, the improved model exhibited higher BLEU scores, lower time delay, and optimal performance when utilizing a convolutional kernel size of 3 × 3. The results demonstrate the dependability of the improved Transformer model in real-time English speech translation, highlighting its practical usefulness.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data used and analyzed in this paper are available from the corresponding author upon reasonable request.
References
Liu H, Zhang M, Pérez A, Xie N, Li B, Liu Q (2019) Role of language control during interbrain phase synchronization of cross-language communication. Neuropsychologia 131:316–324
Gaido M, Tang Y, Kulikov I, Huang R, Gong H, Inaguma H (2023), Named Entity Detection and Injection for Direct Speech Translation. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, pp.1–5.
Kano T, Sakti S, Nakamura S (2021), Transformer-Based Direct Speech-To-Speech Translation with Transcoder. In: 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, pp.958–965.
Dinh TA, Liu D, Niehues J (2022), Tackling Data Scarcity in Speech Translation Using Zero-Shot Multilingual Machine Translation Techniques. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, pp.6222–6226.
Wu YJ, Qin Y (2022) Machine translation of english speech: comparison of multiple algorithms. J Intell Syst 31:159–167
Iranzo-Sánchez J, Jorge J, Baquero-Arnal P, Silvestre-Cerdà JA, Giménez A, Civera J, Sanchis A, Juan A (2021) Streaming cascade-based speech translation leveraged by a direct segmentation model. Neural Netw 142:303–315
Birkenbeuel J, Joyce H, Sahyouni R, Cheung D, Maducdoc MM, Mostaghni N, Sahyouni S, Djalilian H, Chen J, Lin HW (2021) Google translate in healthcare: preliminary evaluation of transcription, translation and speech synthesis accuracy. BMJ Innov 7:422–429
Balpande M, Sansare R, Padelkar T, Shinde V (2021), Speaker Recognition based on Mel-Frequency Cepstral Coefficients and Vector Quantization. In: 2021 IEEE Bombay Section Signature Conference (IBSSC), Gwalior, India, pp.1–6.
Ray S, Kinget PR (2023) Ultra-low-power and compact-area analog audio feature extraction based on time-mode analog filterbank interpolation and time-mode analog rectification. IEEE J Solid-State Circuits 58:1025–1036
Miao H, Cheng G, Zhang P (2022) Low-latency transformer model for streaming automatic speech recognition. Electron Lett 58:44–46
Wei Y, Wu C, Li G, Shi H (2022) Sequential transformer via an outside-in attention for image captioning. Eng Appl Artif Intell 108:1–8
Dong Q, Cao C, Fu Y (2022), Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp.11348–11358.
Wang H, Yang J, Wang R, Shi L (2023) Remaining useful life prediction of bearings based on convolution attention mechanism and temporal convolution network. IEEE Access 11:24407–24419
Bhandari V, Londhe ND, Kshirsagar GB (2023) Compact temporal dilated convolution with channel-wise attention and cost sensitive learning for Single trial P300 detection. Biomed Signal Process Control 85:104924
Cattoni R, Di Gangi MA, Bentivogli L, Negri M, Turchi M (2021) MuST-C: A multilingual corpus for end-to-end speech translation. Comput Speech Lang 66:1–14
Adlaon KMM, Marcos N (2018), Neural Machine Translation for Cebuano to Tagalog with Subword Unit Translation. In: 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, pp. 328–333
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Lei, X. Real-time translation of English speech through speech feature extraction. Artif Life Robotics 29, 410–415 (2024). https://doi.org/10.1007/s10015-024-00951-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10015-024-00951-w