Hybrid CTC-Attention Network-Based End-to-End Speech Recognition System for Korean Language

Authors

  • Hosung Park Sogang University, Seoul, South Korea
  • Changmin Kim LG Electronics, Seoul, Korea
  • Hyunsoo Son Sogang University, Seoul, South Korea
  • Soonshin Seo Naver Corporation, Gyeonggi Province, South Korea https://orcid.org/0000-0002-1897-8256
  • Ji-Hwan Kim Sogang University, Seoul, South Korea https://orcid.org/0000-0001-9054-2994

DOI:

https://doi.org/10.13052/jwe1540-9589.2126

Keywords:

End-to-end speech recognition, Hybrid CTC-attention network, Korean speech recognition

Abstract

In this study, an automatic end-to-end speech recognition system based on hybrid CTC-attention network for Korean language is proposed. Deep neural network/hidden Markov model (DNN/HMM)-based speech recognition system has driven dramatic improvement in this area. However, it is difficult for non-experts to develop speech recognition for new applications. End-to-end approaches have simplified speech recognition system into a single-network architecture. These approaches can develop speech recognition system that does not require expert knowledge. In this paper, we propose hybrid CTC-attention network as end-to-end speech recognition model for Korean language. This model effectively utilizes a CTC objective function during attention model training. This approach improves the performance in terms of speech recognition accuracy as well as training speed. In most languages, end-to-end speech recognition uses characters as output labels. However, for Korean, character-based end-to-end speech recognition is not an efficient approach because Korean language has 11,172 possible numbers of characters. The number is relatively large compared to other languages. For example, English has 26 characters, and Japanese has 50 characters. To address this problem, we utilize Korean 49 graphemes as output labels. Experimental result shows 10.02% character error rate (CER) when 740 hours of Korean training data are used.

Downloads

Download data is not yet available.

Author Biographies

Hosung Park, Sogang University, Seoul, South Korea

Hosung Park received the B.E. degree in computer science and engineering from Handong Global University in 2016. He also received the M.E. degree in computer science and engineering from Sogang University in 2018. He is currently pursuing the Ph.D. degree in computer science and engineering with Sogang University. His research interests include speech recognition and spoken multimedia content.

Hyunsoo Son, Sogang University, Seoul, South Korea

Hyunsoo Son received the B.E. degree in computer science and engineering from Sogang University in 2019. He is currently pursuing the M.E. degree in computer science and engineering with Sogang University. His research interests include speech recognition and spoken multimedia content search.

Soonshin Seo, Naver Corporation, Gyeonggi Province, South Korea

Soonshin Seo received the B.A. degree in linguistics and the B.E. degree in computer science and engineering from Hankuk University of Foreign Studies in 2018. He is currently pursuing the Ph.D. degree in computer science and engineering with Sogang University. Since 2021, he has also been a Research Engineer with Naver Corporation. His research interests include speech recognition and spoken multimedia content search.

Ji-Hwan Kim, Sogang University, Seoul, South Korea

Ji-Hwan Kim received the B.E. and M.E. degrees in computer science from Korea Advanced Institute of Science and Technology (KAIST) in 1996 and 1998, respectively, and the Ph.D. degree in engineering from the University of Cambridge in 2001. From 2001 to 2007, he was a Chief Research Engineer and a Senior Research Engineer with LG Electronics Institute of Technology, where he was engaged in development of speech recognizers for mobile devices. In 2004, he was a Visiting Scientist with MIT Media Lab. Since 2007, he has been a Faculty Member with the Department of Computer Science and Engineering, Sogang University. He is currently a full Professor. His research interests include spoken multimedia content search, speech recognition for embedded systems, and dialogue understanding.

References

G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, N., … and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing magazine, Vol. 29, No. 6, pp. 82–97, 2012.

D. Palaz, M. Magimai-Doss, and R. Collobert, “End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition,” Speech Communication., Vol. 108, pp. 15–32, 2019.

A. Graves, and S. Fern, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” Proceedings of the 23rd International Conference on Machine learning, pp. 369–376, 2006.

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” Proceedings of ICLR, 2015.

K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” Proceedings of ICML, 2015.

J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: First results,” arXiv preprint arXiv:1412.1602, 2014.

D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp. 4945–4949, 2016.

Z. Xiao, Z. Ou, W. Chu, and H. Lin, “Hybrid CTC-Attention based end-to-end speech recognition using subword units,” arXiv preprint arXiv:1807.04978, 2018.

B. Li, Y. Zhang, T. Sainath, Y. Wu, W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, Vol. 1, pp. 5621–5625, 2019.

T. Moriya, J. Wang, T. Tanaka, R. Masumura, Y. Shinohara, Y. Yamaguchi, and Y. Aono, “Joint maximization decoder with neural converters for fully neural network-based Japanese speech recognition,” Proceedings of INTERSPEECH, Graz, Austria, pp. 4410–4414, 2019.

D. Amodei et al., “Deep Speech 2: End-to-end speech recognition in English and Mandarin,” Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48, pp. 173–182, 2016.

J. Ha, K. Nam, J. Kang, S. Lee, S. Yang, H. Jung, E. Kim, H. Kim, …S. Kim, “ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers,” retrieved at arXiv:2004.09367 [cs.LG], 2020.

D. Lee, J. Park, K. Kim, J. Park, J. Kim, G. Jang, and U. Park, “Maximum likelihood-based automatic lexicon generation for AI assistant-based interaction with mobile devices,” KSII Transactions on Internet & Information Systems, Vol. 11, No. 9, pp. 4264–4279, 2017.

Y. Lee, “Onset analysis of Korean on-glides,” in Theoretical Issues in Korean Linguistics, Center for the Study of Language, pp. 133–156, USA, 1994.

E. Variani, T. Bagby, K. Lahouel, E. McDermott, and M. Bacchiani, “Sampled connectionist temporal classification,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 4959–4963, 2018.

T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM,” retrieved at arXiv:1706.02737v1[cs.CL], 2017.

J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, “Attention-based models for speech recognition,” Proceedings of advances in neural information processing systems, Montreal, Canada, pp. 577–585, 2015.

W. Chan, et al. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp. 4960–4964, 2016.

Y. He, T. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. Chang, K. Rao, and A. Gruenstein, “Streaming end-to-end speech recognition for mobile devices,” arXiv preprint arXiv:1811.06621, 2018.

E. Battenberg, et al. “Exploring neural transducers for end-to-end speech recognition,” Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, Okinawa, Japan, pp. 206–213, 2017.

Soundsnap, “Download sound effect | soundsnap sound library,” Soundsnap, https://www.soundsnap.com/, accessed Dec. 26. 2014.

S. Watanabe T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” Proceedings of INTERSPEECH, Hyderabad, India, pp. 2207–2211, 2018.

M. I. Abouelhoda and E. Ohlebusch, “CHAINER: Software for comparing genomes,” Proceedings of the 12th International Conference on Intelligent Systems for Molecular Biology 3rd European Conference on Computational Biology, Glasgow, UK, pp. 1–3, 2004.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” Proceedings of NIPS, Long Beach, CA, USA, pp. 1–4, 2017.

H. Ting, Y. Yingchun, and W. Zhaohui, “Combining MFCC and Pitch to Enhance the Performance of the Gender Recognition,” Proceedings of ICSP, Minneapolis, MN, USA, pp. 3–6, 2007.

Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, Arizona, USA, pp. 167–174, 2016.

T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Multi-head decoder for End-to-end speech recognition,” Proceedings of INTERSPEECH, Hyderabad, India, pp. 801–805, 2018.

D. Povey et al., “The Kaldi Speech Recognition Toolkit,” Proceedings of ASRU, Hawaii, USA, pp. 1–4, 2011.

P. C. Woodland and D. Povey, “Large scale discriminative training of hidden Markov models for speech recognition,” Computer Speech & Language, Vol. 16, no. 1, pp. 25–47, 2002.

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transaction on acoustic, speech, and signal processing, Vol. 37, No. 3, pp. 393–404, 1989.

H. Hadian, H. Sameti, D. Povey, and S. Khudanpur, “End-to-end speech recognition using Lattice-free MMI,” Proceeding of Inter-speech 2018, pp. 12–16, 2018.

A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” Proc. The 31st International Conference on Machine Learning, Beijing, China, pp. 1764–1772, 2014.

V. Valtchev, J. Odell, P. C. Woodland, and S. J. Young, “Lattice-based discriminative training for large vocabulary speech recognition,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Atlanta, USA, Vol. 2, pp. 605–608, 1996.

D. Gillick, C. Brunk, O. Vinyals, and A. Subramanya, “Multilingual language processing from bytes,” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies., San Diego, California, pp. 1296–1306, 2016.

Downloads

Published

2022-01-04

Issue

Section

Communication, Multimedia and Learning Technology through Future Web Engineering