default search action
ASRU 2021: Cartagena, Colombia
- IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021. IEEE 2021, ISBN 978-1-6654-3739-4
- Christian Huber, Juan Hussain, Sebastian Stüker, Alexander Waibel:
Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition. 1-7 - Maxime Burchi, Valentin Vielzeuf:
Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition. 8-15 - Florian Boyer, Yusuke Shinohara, Takaaki Ishii, Hirofumi Inaguma, Shinji Watanabe:
A Study of Transducer Based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies. 16-23 - Norbert Braunschweiler, Rama Doddipatla, Simon Keizer, Svetlana Stoyanchev:
A Study on Cross-Corpus Speech Emotion Recognition and Data Augmentation. 24-30 - Sebastian P. Bayerl, Aniruddha Tammewar, Korbinian Riedhammer, Giuseppe Riccardi:
Detecting Emotion Carriers by Combining Acoustic and Lexical Representations. 31-38 - Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak:
Beyond Isolated Utterances: Conversational Emotion Recognition. 39-46 - Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe:
A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation. 47-54 - Fu-An Chao, Shao-Wei Fan-Jiang, Bi-Cheng Yan, Jeih-weih Hung, Berlin Chen:
TENET: A Time-Reversal Enhancement Network for Noise-Robust ASR. 55-61 - Liqiang He, Shulin Feng, Dan Su, Dong Yu:
Latency-Controlled Neural Architecture Search for Streaming Speech Recognition. 62-67 - Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara:
Data Augmentation for ASR Using TTS Via a Discrete Representation. 68-75 - Keqi Deng, Songjun Cao, Yike Zhang, Long Ma:
Improving Hybrid CTC/Attention End-to-End Speech Recognition with Pretrained Acoustic and Language Models. 76-82 - Linchen Zhu, Wenjie Liu, Linquan Liu, Edward Lin:
Improving ASR Error Correction Using N-Best Hypotheses. 83-89 - Prachi Singh, Sriram Ganapathy:
Self-Supervised Metric Learning With Graph Clustering For Speaker Diarization. 90-97 - Shota Horiguchi, Shinji Watanabe, Paola García, Yawen Xue, Yuki Takashima, Yohei Kawaguchi:
Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors. 98-105 - Yi Ma, Kong Aik Lee, Ville Hautamäki, Haizhou Li:
PL-EESR: Perceptual Loss Based End-to-End Robust Speaker Representation Extraction. 106-113 - Naohiro Tawara, Atsunori Ogawa, Yuki Kitagishi, Hosana Kamiyama, Yusuke Ijima:
Robust Speech-Age Estimation Using Local Maximum Mean Discrepancy Under Mismatched Recording Conditions. 114-121 - Meng Liu, Longbiao Wang, Kong Aik Lee, Hanyi Zhang, Chang Zeng, Jianwu Dang:
DeepLip: A Benchmark for Deep Learning-Based Audio-Visual Lip Biometrics. 122-129 - Jeong-Hwan Choi, Joon-Young Yang, Joon-Hyuk Chang:
Short-Utterance Embedding Enhancement Method Based on Time Series Forecasting Technique for Text-Independent Speaker Verification. 130-137 - Yan Gao, Titouan Parcollet, Nicholas D. Lane:
Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition. 138-145 - Biel Tura, Santiago Escuder, Ferran Diego, Carlos Segura, Jordi Luque:
Efficient Keyword Spotting by Capturing Long-Range Interactions with Temporal Lambda Networks. 146-153 - Mohan Li, Rama Doddipatla:
Improving HS-DACS Based Streaming Transformer ASR with Deep Reinforcement Learning. 154-161 - Xianrui Zheng, Chao Zhang, Philip C. Woodland:
Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition. 162-168 - Jakob Poncelet, Hugo Van hamme:
Comparison of Self-Supervised Speech Pre-Training Methods on Flemish Dutch. 169-176 - Timo Lohrenz, Patrick Schwarz, Zhengyang Li, Tim Fingscheidt:
Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition. 177-184 - Xuechen Liu, Md. Sahidullah, Tomi Kinnunen:
Optimized Power Normalized Cepstral Coefficients Towards Robust Deep Speaker Verification. 185-190 - Pierre Champion, Thomas Thebaud, Gaël Le Lan, Anthony Larcher, Denis Jouvet:
On the Invertibility of a Voice Privacy System Using Embedding Alignment. 191-197 - Jingyu Li, Si Ioi Ng, Tan Lee:
Improving Text-Independent Speaker Verification with Auxiliary Speakers Using Graph. 198-205 - Li Zhang, Qing Wang, Lei Xie:
Duality Temporal-Channel-Frequency Attention Enhanced Speaker Representation Learning. 206-213 - Fangyuan Wang, Zhigang Song, Hongchen Jiang, Bo Xu:
MACCIF-TDNN: Multi Aspect Aggregation of Channel and Context Interdependence Features in TDNN-Based Speaker Verification. 214-219 - Zhuo Li, Ce Fang, Runqiu Xiao, Wenchao Wang, Yonghong Yan:
SI-Net: Multi-Scale Context-Aware Convolutional Block for Speaker Verification. 220-227 - Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-Wen Yang, Yu Tsao, Hung-yi Lee, Shinji Watanabe:
An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition. 228-235 - Dhanush Bekal, Ashish Shenoy, Monica Sunkara, Sravan Bodapati, Katrin Kirchhoff:
Remember the Context! ASR Slot Error Correction Through Memorization. 236-243 - Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu:
w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training. 244-250 - Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro J. Moreno:
Injecting Text in Self-Supervised Speech Pretraining. 251-258 - Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha:
TS-RIR: Translated Synthetic Room Impulse Responses for Speech Augmentation. 259-266 - Peter Vieting, Christoph Lüscher, Wilfried Michel, Ralf Schlüter, Hermann Ney:
On Architectures and Training for Raw Waveform Feature Extraction in ASR. 267-274 - Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw:
Multi-User Voicefilter-Lite via Attentive Speaker Embedding. 275-282 - Midia Yousefi, John H. L. Hansen:
Speaker Conditioning of Acoustic Models Using Affine Transformation for Multi-Speaker Speech Recognition. 283-288 - Szu-Jui Chen, Wei Xia, John H. L. Hansen:
Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora. 289-295 - Naoyuki Kanda, Xiong Xiao, Jian Wu, Tianyan Zhou, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka:
A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio. 296-303 - Tom O'Malley, Arun Narayanan, Quan Wang, Alex Park, James Walker, Nathan Howard:
A Conformer-Based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation. 304-311 - Arun Narayanan, Chung-Cheng Chiu, Tom O'Malley, Quan Wang, Yanzhang He:
Cross-Attention Conformer for Context Modeling in Speech Enhancement for ASR. 312-319 - Li Fu, Xiaoxiao Li, Libo Zi, Zhengchen Zhang, Youzheng Wu, Xiaodong He, Bowen Zhou:
Incremental Learning for End-to-End Automatic Speech Recognition. 320-327 - Fan Yu, Haoneng Luo, Pengcheng Guo, Yuhao Liang, Zhuoyuan Yao, Lei Xie, Yingying Gao, Leijing Hou, Shilei Zhang:
Boundary and Context Aware Training for CIF-Based Non-Autoregressive End-to-End ASR. 328-334 - Jing Zhao, Gui-Xin Shi, Guan-Bo Wang, Wei-Qiang Zhang:
Automatic Speech Recognition for Low-Resource Languages: The Thuee Systems for the IARPA Openasr20 Evaluation. 335-341 - Chandran Savithri Anoop, Prathosh A. P., A. G. Ramakrishnan:
Unsupervised Domain Adaptation Schemes for Building ASR in Low-Resource Languages. 342-349 - Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda:
Multimodal Emotion Recognition with High-Level Speech and Text Features. 350-357 - Zhi Zhu, Yoshinao Sato:
Speech Emotion Recognition Using Semi-Supervised Learning with Efficient Labeling Strategies. 358-365 - Jin Li, Nan Yan, Lan Wang:
Unsupervised Cross-Lingual Speech Emotion Recognition Using Pseudo Multilabel. 366-373 - Shi-wook Lee:
Ensemble of Domain Adversarial Neural Networks for Speech Emotion Recognition. 374-379 - Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara:
ASR Rescoring and Confidence Estimation with Electra. 380-387 - Sachin Singh, Ashutosh Gupta, Aman Maghan, Dhananjaya Gowda, Shatrughan Singh, Chanwoo Kim:
Comparative Study of Different Tokenization Strategies for Streaming End-to-End ASR. 388-394 - Dhananjaya Gowda, Abhinav Garg, Jiyeon Kim, Mehul Kumar, Sachin Singh, Ashutosh Gupta, Ankur Kumar, Nauman Dawalatabad, Aman Maghan, Shatrughan Singh, Chanwoo Kim:
HiTNet: Byte-to-BPE Hierarchical Transcription Network for End-to-End Speech Recognition. 395-402 - Nauman Dawalatabad, Tushar Vatsal, Ashutosh Gupta, Sungsoo Kim, Shatrughan Singh, Dhananjaya Gowda, Chanwoo Kim:
Two-Pass End-to-End ASR Model Compression. 403-410 - Qinglin Zhang, Qian Chen, Yali Li, Jiaqing Liu, Wen Wang:
Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation. 411-418 - Bidisha Sharma, Maulik C. Madhavi, Xuehao Zhou, Haizhou Li:
Exploring Teacher-Student Learning Approach for Multi-Lingual Speech-to-Intent Classification. 419-426 - Tan Liu, Wu Guo:
Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features. 427-432 - Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura:
Hierarchical Knowledge Distillation for Dialogue Sequence Labeling. 433-440 - Jaeyun Song, Hajin Shim, Eunho Yang:
Learning How Long to Wait: Adaptively-Constrained Monotonic Multihead Attention for Streaming ASR. 441-448 - Wei Liu, Tan Lee:
Utterance-Level Neural Confidence Measure for End-to-End Children Speech Recognition. 449-456 - Kiran Praveen, Hardik B. Sailor, Abhishek Pandey:
Warped Ensembles: A Novel Technique for Improving CTC Based End-to-End Speech Recognition. 457-464 - Shun-Po Chuang, Heng-Jui Chang, Sung-Feng Huang, Hung-yi Lee:
Non-Autoregressive Mandarin-English Code-Switching Speech Recognition. 465-472 - Ashutosh Gupta, Aditya Jayasimha, Aman Maghan, Shatrughan Singh, Dhananjaya Gowda, Chanwoo Kim:
Voice to Action: Spoken Language Understanding for Memory-Constrained Systems. 473-479 - Jen-Tzung Chien, Chih-Jung Tsai:
Variational Sequential Modeling, Learning and Understanding. 480-486 - Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe:
Attention-Based Multi-Hypothesis Fusion for Speech Summarization. 487-494 - Koichiro Ito, Masaki Murata, Tomohiro Ohno, Shigeki Matsubara:
Estimating the Generation Timing of Responsive Utterances by Active Listeners of Spoken Narratives. 495-502 - Feng-Ju Chang, Jing Liu, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo, Ariya Rastrow, Siegfried Kunzmann:
Context-Aware Transformer Transducer for Speech Recognition. 503-510 - Suwa Xu, Jinwon Lee, Jim Steele:
PSVD: Post-Training Compression of LSTM-Based RNN-T Models. 511-517 - Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed:
Kaizen: Continuously Improving Teacher Using Exponential Moving Average for Semi-Supervised Speech Recognition. 518-525 - Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong:
On Addressing Practical Challenges for RNN-Transducer. 526-533 - Felix Weninger, Marco Gaudesi, Ralf Leibold, Roberto Gemello, Puming Zhan:
Dual-Encoder Architecture with Encoder Selection for Joint Close-Talk and Far-Talk Speech Recognition. 534-540 - Mohammad Omar Khursheed, Christin Jose, Rajath Kumar, Gengshen Fu, Brian Kulis, Santosh Kumar Cheekatmalla:
Tiny-CRNN: Streaming Wakeword Detection in a Low Footprint Setting. 541-547 - Shaojin Ding, Ye Jia, Ke Hu, Quan Wang:
Textual Echo Cancellation. 548-555 - Daniel Escobar-Grisales, Cristian D. Ríos-Urrego, Diego Alexander Lopez-Santander, Jeferson David Gallo-Aristizábal, Juan Camilo Vásquez-Correa, Elmar Nöth, Juan Rafael Orozco-Arroyave:
Colombian Dialect Recognition Based on Information Extracted from Speech and Text Signals. 556-563 - Yangyang Xia, Buye Xu, Anurag Kumar:
Incorporating Real-World Noisy Speech in Neural-Network-Based Speech Enhancement Systems. 564-570 - Takuya Higuchi, Anmol Gupta, Chandra Dhir:
Multi-Task Learning with Cross Attention for Keyword Spotting. 571-578 - Xinhao Wang, Christopher Hamill:
Automatic Generation of Diagnostic Content Feedback in Spoken Language Learning and Assessment. 579-586 - Thomas Schaaf, Longxiang Zhang, Alireza Bayestehtashk, Mark C. Fuhs, Shahid Durrani, Susanne Burger, Monika Woszczyna, Thomas Polzin:
Are You Dictating to Me? Detecting Embedded Dictations in Doctor-Patient Conversations. 587-593 - Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li:
Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer. 594-601 - Mengxin Chai, Shaotong Guo, Cheng Gong, Longbiao Wang, Jianwu Dang, Ju Zhang:
Learning Language and Speaker Information for Code-Switch Speech Synthesis with Limited Data. 602-609 - Takuma Okamoto, Tomoki Toda, Hisashi Kawai:
Multi-Stream HiFi-GAN with Data-Driven Waveform Decomposition. 610-617 - Sergey Nikonorov, Berrak Sisman, Mingyang Zhang, Haizhou Li:
DEEPA: A Deep Neural Analyzer for Speech and Singing Vocoding. 618-625 - Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee:
EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion. 626-633 - Raymond Chung, Brian Mak:
On-The-Fly Data Augmentation for Text-to-Speech Style Transfer. 634-641 - Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki Toda:
On Prosody Modeling for ASR+TTS Based Voice Conversion. 642-649 - Ming-Chi Yen, Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Shu-Wei Tsai, Yu Tsao, Tomoki Toda, Jyh-Shing Roger Jang, Hsin-Min Wang:
Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling. 650-657 - Jiangyu Han, Wei Rao, Yanhua Long, Jiaen Liang:
Attention-Based Scaling Adaptation for Target Speech Extraction. 658-662 - Huiyu Shi, Xi Chen, Tianlong Kong, Shouyi Yin, Peng Ouyang:
GLMSnet: Single Channel Speech Separation Framework in Noisy and Reverberant Environments. 663-670 - Lu Zhang, Chenxing Li, Feng Deng, Xiaorui Wang:
Multi-Task Audio Source Separation. 671-678 - Wei Rao, Yihui Fu, Yanxin Hu, Xin Xu, Yvkai Jv, Jiangyu Han, Zhongjie Jiang, Lei Xie, Yannan Wang, Shinji Watanabe, Zheng-Hua Tan, Hui Bu, Tao Yu, Shidong Shang:
Conferencingspeech Challenge: Towards Far-Field Multi-Channel Speech Enhancement for Video Conferencing. 679-686 - Khaled Hechmi, Trung Ngo Trong, Ville Hautamäki, Tomi Kinnunen:
Voxceleb Enrichment for Age and Gender Recognition. 687-693 - Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa, Carlos Segura:
Enabling Zero-Shot Multilingual Spoken Language Translation with Language-Specific Encoders and Decoders. 694-701 - Neil Zeghidour, Olivier Teboul, David Grangier:
Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding. 702-709 - Damien Ronssin, Milos Cernak:
AC-VC: Non-Parallel Low Latency Phonetic Posteriorgrams Based Voice Conversion. 710-716 - Marvin Borsdorf, Haizhou Li, Tanja Schultz:
Target Language Extraction at Multilingual Cocktail Parties. 717-724 - Jose Antonio Lopez Saenz, Md Asif Jalal, Rosanna Milner, Thomas Hain:
Attention Based Model for Segmental Pronunciation Error Detection. 725-732 - Elizabeth Salesky, Julian Mäder, Severin Klinger:
Assessing Evaluation Metrics for Speech-to-Speech Translation. 733-740 - Songxiang Liu, Yuewen Cao, Dan Su, Helen Meng:
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion. 741-748 - Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari:
Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network. 749-756 - Björn Plüster, Cornelius Weber, Leyuan Qu, Stefan Wermter:
Hearing Faces: Target Speaker Text-to-Speech Synthesis from a Face. 757-764 - Bhagyashree Mukherjee, Anusha Prakash, Hema A. Murthy:
Analysis of Conversational Speech with Application to Voice Adaptation. 765-772 - Ruolan Liu, Xue Wen, Chunhui Lu, Liming Song, June Sig Sung:
Vibrato Learning in Multi-Singer Singing Voice Synthesis. 773-779 - Guangzhi Sun, Chao Zhang, Philip C. Woodland:
Tree-Constrained Pointer Generator for End-to-End Contextual Speech Recognition. 780-787 - Nick Rossenbach, Mohammad Zeineldeen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney:
Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures. 788-795 - Dmitriy Serdyuk, Otavio Braga, Olivier Siohan:
Audio-Visual Speech Recognition is Worth $32\times 32\times 8$ Voxels. 796-802 - Andrea Carmantini, Steve Renals, Peter Bell:
Leveraging Linguistic Knowledge for Accent Robustness of End-to-End Models. 803-810 - Abbas Khosravani, Philip N. Garner, Alexandros Lararidis:
An Evaluation Benchmark for Automatic Speech Recognition of German-English Code-Switching. 811-816 - Abbas Khosravani, Philip N. Garner, Alexandros Lazaridis:
Learning to Translate Low-Resourced Swiss German Dialectal Speech into Standard German Text. 817-823 - Marco Gaudesi, Felix Weninger, Dushyant Sharma, Puming Zhan:
ChannelAugment: Improving Generalization of Multi-Channel ASR by Training with Input Channel Randomization. 824-829 - Chia-Yu Li, Ngoc Thang Vu:
Improving Speech Recognition on Noisy Speech via Speech Enhancement with Multi-Discriminators CycleGAN. 830-836 - Kai Wei, Thanh Tran, Feng-Ju Chang, Kanthashree Mysore Sathyendra, Thejaswi Muniyappa, Jing Liu, Anirudh Raju, Ross McGowan, Nathan Susanj, Ariya Rastrow, Grant P. Strimel:
Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding. 837-844 - Zheng Gao, Mohamed Abdelhady, Radhika Arava, Xibin Gao, Qian Hu, Wei Xiao, Thahir Mohamed:
X-SHOT: Learning to Rank Voice Applications Via Cross-Locale Shard-Based Co-Training. 845-852 - Akshat Gupta, Olivia Deng, Akruti Kushwaha, Saloni Mittal, William Zeng, Sai Krishna Rallabandi, Alan W. Black:
Intent Recognition and Unsupervised Slot Identification for Low-Resourced Spoken Dialog Systems. 853-860 - Kishan Sachdeva, Joshua Maynez, Olivier Siohan:
Action Item Detection in Meetings Using Pretrained Transformers. 861-868 - Joo-Kyung Kim, Guoyin Wang, Sungjin Lee, Young-Bum Kim:
Deciding Whether to Ask Clarifying Questions in Large-Scale Spoken Language Understanding. 869-876 - Guan-Lin Chao, Ian R. Lane:
Human-Agent Collaboration Strategies for Vision-Grounded Instruction Following. 877-884 - Binghuai Lin, Liyuan Wang:
Uncertainty-Aware Pseudo-Labeling for Spoken Language Assessment. 885-891 - Xuan Ji, Lu Lu, Fuming Fang, Jianbo Ma, Lei Zhu, Jinke Li, Dongdi Zhao, Ming Liu, Feijun Jiang:
An End-to-End Far-Field Keyword Spotting System with Neural Beamforming. 892-899 - Rohith Aralikatti, Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha:
Improving Reverberant Speech Separation with Synthetic Room Impulse Responses. 900-906 - Hsin-Tien Chiang, Yi-Chiao Wu, Cheng Yu, Tomoki Toda, Hsin-Min Wang, Yih-Chun Hu, Yu Tsao:
HASA-Net: A Non-Intrusive Hearing-Aid Speech Assessment Network. 907-913 - Ankita Pasad, Ju-Chieh Chou, Karen Livescu:
Layer-Wise Analysis of a Self-Supervised Speech Representation Model. 914-921 - Hirofumi Inaguma, Siddharth Dalmia, Brian Yan, Shinji Watanabe:
Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates. 922-929 - Xulong Zhang, Jianzong Wang, Ning Cheng, Edward Xiao, Jing Xiao:
Cyclegean: Cycle Generative Enhanced Adversarial Network for Voice Conversion. 930-937 - Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Zhen Zeng, Edward Xiao, Jing Xiao:
TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training. 938-945 - Aolan Sun, Jianzong Wang, Ning Cheng, Methawee Tantrawenith, Zhiyong Wu, Helen Meng, Edward Xiao, Jing Xiao:
Reconstructing Dual Learning for Neural Voice Conversion Using Relatively Few Samples. 946-953 - Chuan-En Hsu, Mahdin Rohmatillah, Jen-Tzung Chien:
Multitask Generative Adversarial Imitation Learning for Multi-Domain Dialogue System. 954-961 - Asier López-Zorrilla, M. Inés Torres, Heriberto Cuayáhuitl:
Audio Embeddings Help to Learn Better Dialogue Policies. 962-968 - Christian Geishauser, Songbo Hu, Hsien-Chin Lin, Nurul Lubis, Michael Heck, Shutong Feng, Carel van Niekerk, Milica Gasic:
What does the User Want? Information Gain for Hierarchical Dialogue Policy Optimisation. 969-976 - Simon Keizer, Norbert Braunschweiler, Svetlana Stoyanchev, Rama Doddipatla:
Dialogue Strategy Adaptation to New Action Sets Using Multi-Dimensional Modelling. 977-983 - Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim:
Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages. 984-988 - Jiyeon Kim, Mehul Kumar, Dhananjaya Gowda, Abhinav Garg, Chanwoo Kim:
A Comparison of Streaming Models and Data Augmentation Methods for Robust Speech Recognition. 989-995 - Rongzhi Gu, Shi-Xiong Zhang, Meng Yu, Dong Yu:
3D Spatial Features for Multi-Channel Target Speech Separation. 996-1002 - Yifan Guo, Yifan Chen, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan:
Far-Field Speech Recognition Based on Complex-Valued Neural Networks and Inter-Frame Similarity Difference Method. 1003-1010 - Bo Li, Ruoming Pang, Tara N. Sainath, Anmol Gulati, Yu Zhang, James Qin, Parisa Haghani, W. Ronny Huang, Min Ma, Junwen Bai:
Scaling End-to-End Models for Large-Scale Multilingual ASR. 1011-1018 - Jiahong Yuan, Xingyu Cai, Dongji Gao, Renjie Zheng, Liang Huang, Kenneth Church:
Decoupling Recognition and Transcription in Mandarin ASR. 1019-1025 - Xiaohui Zhang, Vimal Manohar, David Zhang, Frank Zhang, Yangyang Shi, Nayan Singhal, Julian Chan, Fuchun Peng, Yatharth Saraf, Mike Seltzer:
On Lattice-Free Boosted MMI Training of HMM and CTC-Based Full-Context ASR Models. 1026-1033 - Chengrui Zhu, Keyu An, Huahuan Zheng, Zhijian Ou:
Multilingual and Crosslingual Speech Recognition Using Phonological-Vector Based Phone Embeddings. 1034-1041 - Markus Müller, Samridhi Choudhary, Clement Chung, Athanasios Mouchtaris, Siegfried Kunzmann:
In Pursuit of Babel - Multilingual End-to-End Spoken Language Understanding. 1042-1049 - Peter Wu, Jiatong Shi, Yifan Zhong, Shinji Watanabe, Alan W. Black:
Cross-Lingual Transfer for Speech Processing Using Acoustic Language Similarity. 1050-1057 - Yasufumi Moriya, Gareth J. F. Jones:
An ASR N-Best Transcript Neural Ranking Model for Spoken Content Retrieval. 1058-1064 - Shao-Wei Fan-Jiang, Bi-Cheng Yan, Tien-Hong Lo, Fu-An Chao, Berlin Chen:
Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods. 1065-1070 - Chuanbo Zhu, Ryo Hakoda, Daisuke Saito, Nobuaki Minematsu, Noriko Nakanishi, Tazuko Nishimura:
Multi-Granularity Annotation of Instantaneous Intelligibility of Learners' Utterances Based on Shadowing Techniques. 1071-1078 - Ralph Scheuerer, Tino Haderlein, Elmar Nöth, Tobias Bocklet:
Applying X-Vectors on Pathological Speech After Larynx Removal. 1079-1086 - Chao-Han Huck Yang, Linda Liu, Ankur Gandhe, Yile Gu, Anirudh Raju, Denis Filimonov, Ivan Bulyko:
Multi-Task Language Modeling for Improving Speech Recognition of Rare Words. 1087-1093 - Nay San, Martijn Bartelds, Mitchell Browne, Lily Clifford, Fiona Gibson, John Mansfield, David Nash, Jane Simpson, Myfany Turpin, Maria Vollmer, Sasha Wilmoth, Dan Jurafsky:
Leveraging Pre-Trained Representations to Improve Access to Untranscribed Speech from Endangered Languages. 1094-1101 - Wentao Zhu, Tianlong Kong, Shun Lu, Jixiang Li, Dawei Zhang, Feng Deng, Xiaorui Wang, Sen Yang, Ji Liu:
SpeechNAS: Towards Better Trade-Off Between Latency and Accuracy for Large-Scale Speaker Verification. 1102-1109 - Mickael Rouvier, Pierre-Michel Bousquet:
Studying Squeeze-and-Excitation Used in CNN for Speaker Verification. 1110-1115 - Woo Hyun Kang, Jahangir Alam, Abderrahim Fathan:
Hybrid Network with Multi-Level Global-Local Statistics Pooling for Robust Text-Independent Speaker Recognition. 1116-1123 - Zhenning Tan, Yuguang Yang, Eunjung Han, Andreas Stolcke:
Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets. 1124-1131 - Xuechen Liu, Md. Sahidullah, Tomi Kinnunen:
Parameterized Channel Normalization for Far-Field Deep Speaker Verification. 1132-1138 - Juan Manuel Coria, Hervé Bredin, Sahar Ghannay, Sophie Rosset:
Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation. 1139-1146 - Seokhwan Kim, Yang Liu, Di Jin, Alexandros Papangelis, Karthik Gopalakrishnan, Behnam Hedayatnia, Dilek Hakkani-Tür:
"How Robust R U?": Evaluating Task-Oriented Dialogue Systems on Spoken Conversations. 1147-1154 - Sivanand Achanta, Albert Antony, Ladan Golipour, Jiangchuan Li, Tuomo Raitio, Ramya Rasipuram, Francesco Rossi, Jennifer Shi, Jaimin Upadhyay, David Winarsky, Hepeng Zhang:
On-Device Neural Speech Synthesis. 1155-1161 - Amrith Setlur, Aman Madaan, Tanmay Parekh, Yiming Yang, Alan W. Black:
Towards Using Heterogeneous Relation Graphs for End-to-End TTS. 1162-1169 - Mingqiu Wang, Hagen Soltau, Laurent El Shafey, Izhak Shafran:
Word-Level Confidence Estimation for RNN Transducers. 1170-1177 - Hira Dhamyal, Ayesha Ali, Ihsan Ayyub Qazi, Agha Ali Raza:
Using Self Attention DNNs to Discover Phonemic Features for Audio Deep Fake Detection. 1178-1184 - Raghavendra Pappagari, Piotr Zelasko, Agnieszka Mikolajczyk, Piotr Pezik, Najim Dehak:
Joint Prediction of Truecasing and Punctuation for Conversational Speech in Low-Resource Scenarios. 1185-1191
manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.