Hybrid CTC-Attention Network-Based End-to-End Speech Recognition System for Korean Language
Keywords:End-to-end speech recognition, Hybrid CTC-attention network, Korean speech recognition
In this study, an automatic end-to-end speech recognition system based on hybrid CTC-attention network for Korean language is proposed. Deep neural network/hidden Markov model (DNN/HMM)-based speech recognition system has driven dramatic improvement in this area. However, it is difficult for non-experts to develop speech recognition for new applications. End-to-end approaches have simplified speech recognition system into a single-network architecture. These approaches can develop speech recognition system that does not require expert knowledge. In this paper, we propose hybrid CTC-attention network as end-to-end speech recognition model for Korean language. This model effectively utilizes a CTC objective function during attention model training. This approach improves the performance in terms of speech recognition accuracy as well as training speed. In most languages, end-to-end speech recognition uses characters as output labels. However, for Korean, character-based end-to-end speech recognition is not an efficient approach because Korean language has 11,172 possible numbers of characters. The number is relatively large compared to other languages. For example, English has 26 characters, and Japanese has 50 characters. To address this problem, we utilize Korean 49 graphemes as output labels. Experimental result shows 10.02% character error rate (CER) when 740 hours of Korean training data are used.
G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, N., … and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing magazine, Vol. 29, No. 6, pp. 82–97, 2012.
D. Palaz, M. Magimai-Doss, and R. Collobert, “End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition,” Speech Communication., Vol. 108, pp. 15–32, 2019.
A. Graves, and S. Fern, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” Proceedings of the 23rd International Conference on Machine learning, pp. 369–376, 2006.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” Proceedings of ICLR, 2015.
K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” Proceedings of ICML, 2015.
J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: First results,” arXiv preprint arXiv:1412.1602, 2014.
D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp. 4945–4949, 2016.
Z. Xiao, Z. Ou, W. Chu, and H. Lin, “Hybrid CTC-Attention based end-to-end speech recognition using subword units,” arXiv preprint arXiv:1807.04978, 2018.
B. Li, Y. Zhang, T. Sainath, Y. Wu, W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, Vol. 1, pp. 5621–5625, 2019.
T. Moriya, J. Wang, T. Tanaka, R. Masumura, Y. Shinohara, Y. Yamaguchi, and Y. Aono, “Joint maximization decoder with neural converters for fully neural network-based Japanese speech recognition,” Proceedings of INTERSPEECH, Graz, Austria, pp. 4410–4414, 2019.
D. Amodei et al., “Deep Speech 2: End-to-end speech recognition in English and Mandarin,” Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48, pp. 173–182, 2016.
J. Ha, K. Nam, J. Kang, S. Lee, S. Yang, H. Jung, E. Kim, H. Kim, …S. Kim, “ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers,” retrieved at arXiv:2004.09367 [cs.LG], 2020.
D. Lee, J. Park, K. Kim, J. Park, J. Kim, G. Jang, and U. Park, “Maximum likelihood-based automatic lexicon generation for AI assistant-based interaction with mobile devices,” KSII Transactions on Internet & Information Systems, Vol. 11, No. 9, pp. 4264–4279, 2017.
Y. Lee, “Onset analysis of Korean on-glides,” in Theoretical Issues in Korean Linguistics, Center for the Study of Language, pp. 133–156, USA, 1994.
E. Variani, T. Bagby, K. Lahouel, E. McDermott, and M. Bacchiani, “Sampled connectionist temporal classification,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 4959–4963, 2018.
T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM,” retrieved at arXiv:1706.02737v1[cs.CL], 2017.
J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, “Attention-based models for speech recognition,” Proceedings of advances in neural information processing systems, Montreal, Canada, pp. 577–585, 2015.
W. Chan, et al. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp. 4960–4964, 2016.
Y. He, T. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. Chang, K. Rao, and A. Gruenstein, “Streaming end-to-end speech recognition for mobile devices,” arXiv preprint arXiv:1811.06621, 2018.
E. Battenberg, et al. “Exploring neural transducers for end-to-end speech recognition,” Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, Okinawa, Japan, pp. 206–213, 2017.
Soundsnap, “Download sound effect | soundsnap sound library,” Soundsnap, https://www.soundsnap.com/, accessed Dec. 26. 2014.
S. Watanabe T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” Proceedings of INTERSPEECH, Hyderabad, India, pp. 2207–2211, 2018.
M. I. Abouelhoda and E. Ohlebusch, “CHAINER: Software for comparing genomes,” Proceedings of the 12th International Conference on Intelligent Systems for Molecular Biology 3rd European Conference on Computational Biology, Glasgow, UK, pp. 1–3, 2004.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” Proceedings of NIPS, Long Beach, CA, USA, pp. 1–4, 2017.
H. Ting, Y. Yingchun, and W. Zhaohui, “Combining MFCC and Pitch to Enhance the Performance of the Gender Recognition,” Proceedings of ICSP, Minneapolis, MN, USA, pp. 3–6, 2007.
Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, Arizona, USA, pp. 167–174, 2016.
T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Multi-head decoder for End-to-end speech recognition,” Proceedings of INTERSPEECH, Hyderabad, India, pp. 801–805, 2018.
D. Povey et al., “The Kaldi Speech Recognition Toolkit,” Proceedings of ASRU, Hawaii, USA, pp. 1–4, 2011.
P. C. Woodland and D. Povey, “Large scale discriminative training of hidden Markov models for speech recognition,” Computer Speech & Language, Vol. 16, no. 1, pp. 25–47, 2002.
A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transaction on acoustic, speech, and signal processing, Vol. 37, No. 3, pp. 393–404, 1989.
H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end speech recognition using Lattice-free MMI,” Proceeding of Inter-speech 2018, pp. 12–16, 2018.
A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” Proc. The 31st International Conference on Machine Learning, Beijing, China, pp. 1764–1772, 2014.
V. Valtchev, J. Odell, P. C. Woodland, and S. J. Young, “Lattice-based discriminative training for large vocabulary speech recognition,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Atlanta, USA, Vol. 2, pp. 605–608, 1996.
D. Gillick, C. Brunk, O. Vinyals, and A. Subramanya, “Multilingual language processing from bytes,” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies., San Diego, California, pp. 1296–1306, 2016.