Deep Neural Networks-based Classification Methodologies of Speech, Audio and Music, and its Integration for Audio Metadata Tagging

Authors

  • Hosung Park Sogang University, Seoul, South Korea https://orcid.org/0000-0003-3537-4048
  • Yoonseo Chung Sogang University, Seoul, South Korea
  • Ji-Hwan Kim Sogang University, Seoul, South Korea

DOI:

https://doi.org/10.13052/jwe1540-9589.2211

Keywords:

Content retrieval, speech recognition, music detection, audio event classification, audio scene classification

Abstract

Videos contain visual and auditory information. Visual information in a video can include images of people, objects, and the landscape, whereas auditory information includes voices, sound effects, background music, and the soundscape. The audio content can provide detailed information on the story by conducting a voice and atmosphere analysis of the sound effects and soundscape. Metadata tags represent the results of a media analysis as text. The tags can classify video content on social networking services, like YouTube. This paper presents the methodologies of speech, audio, and music processing. Also, we propose integrating these audio tagging methods and applying them in an audio metadata generation system for video storytelling. The proposed system automatically creates metadata tags based on speech, sound effects, and background music information from the audio input. The proposed system comprises five subsystems: (1) automatic speech recognition, which generates text from the linguistic sounds in the audio, (2) audio event classification for the type of sound effect, (3) audio scene classification for the type of place from the soundscape, (4) music detection for the background music, and (5) keyword extraction from the automatic speech recognition results. First, the audio signal is converted into a suitable form, which is subsequently combined from each subsystem to create metadata for the audio content. We evaluated the proposed system using video logs (vlogs) on YouTube. The proposed system exhibits a similar accuracy to handcrafted metadata for the audio content, and for a total of 104 YouTube vlogs, achieves an accuracy of 65.83%.

Downloads

Download data is not yet available.

Author Biographies

Hosung Park, Sogang University, Seoul, South Korea

Hosung Park received his B.E. degree in Computer Science and Engineering from Handong Global University in 2016. He also received his M.E. degree in Computer Science and Engineering from Sogang University in 2018. He is currently pursuing a Ph.D. degree in Computer Science and Engineering at Sogang University. His research interests include speech recognition and spoken multimedia content.

Yoonseo Chung, Sogang University, Seoul, South Korea

Yoonseo Chung received his B.E. degree in Computer Science and Engineering from Sogang University in 2022. He is currently pursuing an M.E. degree in Computer Science and Engineering at Sogang University. His research interests include speech recognition and audio event classification.

Ji-Hwan Kim, Sogang University, Seoul, South Korea

Ji-Hwan Kim received his B.E. and M.E. degrees in Computer Science from KAIST (Korea Advanced Institute of Science and Technology) in 1996 and 1998, respectively, and his Ph.D. degree in Engineering from the University of Cambridge in 2001. From 2001 to 2007, he was a chief research engineer and a senior research engineer for LG Electronics Institute of Technology, where he was engaged in the development of speech recognizers for mobile devices. In 2004, he was a visiting scientist at the MIT Media Lab. Since 2007, he has been a faculty member in the Department of Computer Science and Engineering, Sogang University. Currently, he is a full professor. His research interests include spoken multimedia content search, speech recognition for embedded systems and dialogue understanding.

References

J. Jeon, H. Jo, 2020, “Effects of audio-visual interactions on soundscape and landscape perception and their influence on satisfaction with the urban environment,” Building and Environment, vol. 169, p. 106544.

S. Choi, K. On, Y. Heo, A. Seo, Y. Jang, et al., 2021, “DramaQA: character-centered video story understanding with hierarchical QA,” in Proc. Association for the Advancement of Artificial Intelligence, pp. 1166–1174.

T. Izumitani, R. Mukai, K. Kashino, 2008, “A background music detection method based on robust feature extraction,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, Nevada, USA, pp. 13–16.

J. Wang, A. Yessenalina, A. Roshan-Ghias. 2021, “Exploring heterogeneous metadata for video recommendation with two-tower model,” in Proc. Recsys 2021 Workshop on Context-Aware Recommender Systems, Amsterdam, Netherlands, pp. 1–8.

Y. Ou, Z. Chen, F. Wu, 2021, “Multimodal local-global attention network for affective video content analysis,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 31, no. 5, pp. 1901–1914.

J. Kanna, S. Raj, M. Meena, S. Meghana, S. Roomi, 2020, “Deep learning based video analytics for person tracking,” in Proc. International Conference on Emerging Trends in Information Technology and Engineering, VIT University Vellore Campus, India, pp. 1–6.

M. Yoon, J. Lee, I. Jo, 2021, “Video learning analytics: Investigating behavioral patterns and learner clusters in video-based online learning,” The Internet and Higher Educations, vol. 50, pp. 1–10.

K. Yordanova, F. Kruger, T. Kirste, 2018, “Providing semantic annotation for the CMU grand challenge dataset,” in Procs. International Workshop on Annotation of useR Data for UbiquitOUs Systems, Athens, Greece, pp. 579–584.

J. Li, D. Yu, J. Huang, Y. Gong, 2012, “Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM,” in Proc. IEEE Spoken Language Technology Workshop, Miami, Florida, USA, pp. 131–136.

M. Seltzer, D. Yu, Y. Wang, 2013, “An investigation of deep neural networks for noise robust speech recognition,” in Proc IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, pp. 7398–7402.

X. Chen, A. Ragni, X. Liu, M. Gales, 2017, “Investigating bidirectional recurrent neural network language models for speech recognition,” in Proc. the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, pp. 269–273.

G. Dahl, D. Yu, L. Deng, A. Acero, 2012, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, pp. 30–42.

M. Suzuki, N. Itoh, T. Nagano, G. Kurata, S. Thomas, 2019, “Improvements to n-gram language model using text generated from neural language model,” in Proc. the 44th International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, pp. 7245–7249.

Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin, 2003, “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155.

M. Mohri, F. Pereira, M. Riley, 2002, “Weighted finite-state transducers in speech recognition,” Computer Speech & Language, vol. 16, pp. 69–88.

S. Eddy, 2004, “What is a hidden Markov model?,” Nature Biotechnology, vol. 22, pp. 1315–1316.

L. Rabiner, 1989, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, pp. 257–286.

E. Arisoy, A. Sethy, B. Ramabhadran, S. Chen, 2015, “Bidirectional recurrent neural network language models for automatic speech recognition,” in Proc. the 40th IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, QLD, Australia, pp. 5421–5425.

W. Chan, N. Jaitly, Q. Le, O. Vinyals, 2016, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp. 4960–4964.

A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber, 2006, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. the international conference on Machine learning, pp. 369–376.

L. Lu, H. Jiang, H. Zhang, 2001, “A robust audio classification and segmentation method,” in Proc. ACM International Conference on Multimedia, Ottawa, Canada, pp. 203–211.

H. Lee, P. Pham, Y. Largman, Y. Ng, 2009, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” in Proc. of Advances in Neural Information Processing Systems, Vancouver, BC, Canada, pp. 1096–1104.

J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad, A. Serralheiro, 2009, “Non-speech audio event detection,” in Proc. International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp. 1973–1976.

M. Lim, D. Lee, H. Park, J. Oh, J. Kim, et al., 2018, “Convolutional neural network based audio event classification,” KSII Transactions on Internet and Information Systems, vol. 12, pp. 2748–2760.

A. Mesaros, T. Heittola, T. Virtanen, 2018, “A multi-device dataset for urban acoustic scene classification,” in Proc. of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop, Surrey, UK, pp. 09–13.

D. Jurafsky,, J. Martin, 2020, “Phonetics,” in Speech and Language Processing, 3rd edn., London: Pearson, pp. 526–547.

Y. Hao, Y. Chen, W. Zhang, G. Chen, L. Ruan, 2021, “A real-time music detection method based on convolutional neural network using Mel-spectrogram and spectral flux,” in Proc. INTER-NOISE and NOISE-CON Congress and Conference, Washington, DC, USA, pp. 4919–5918.

M. Zhang, J. Wu, H. Lin, P. Yuan, Y. Song, 2017, “The application of one-class classifier based on CNN in image defect detection,” Procedia Computer Science, vol. 114, pp. 341–348.

R. Mihalcea, P. Tarau, 2004, “Textrank: Bringing order into text,” in Proc. the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. pp. 404–411.

P. Berkhin, 2011, “A survey on PageRank computing,” Internet Mathematics, vol. 2, pp. 73–120.

J. Chang, N. Kim, S. Mitra, 2006, “Voice activity detection based on multiple statistical models,” IEEE Transactions on Signal Processing, vol. 54, pp. 1965–1976.

S. Seo, C. Kim, J. Kim, 2022, “Convolutional neural networks using log mel-spectrogram separation for audio event classification with unknown devices, Journal of Web Engineering, vol. 21, pp. 497–522.

K. He, X. Zhang, S. Ren, J. Sun, 2016, “Deep residual learning for image recognition,” in Proc. CVPR 2016 – 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 770–778.

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen, 2016, “Mobilenetv2: inverted residuals and linear bottlenecks,” in Proc. CVPR 2018 – 31st IEEE Conference On Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 4510–4520.

Q. Zhang, F. Xu, J. Bai, 2021, “Audio fingerprint retrieval method based on feature dimension reduction and feature combination,” KSII Transactions on Internet and Information Systems, vol. 15, pp. 522–539.

A. Stolcke, 2002, “SRILM – an extensible language modeling toolkit,” in Proc. International Conference on Spoken Language Processing.

A. Kumar, R. Baruah, R. Mundotiya, A. Singh, 2020, “Transformer-based neural machine translation system for Hindi–Marathi: WMT20 shared task,” in Proc. Fifth Conference on Machine Translation, pp. 393–395.

C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, M. Mohri, 2007, “OpenFst: A general and efficient weighted finite-state transducer library,” in Proc. International Conference on Implementation and Application of Automata, Berlin, Heidelberg: Springer, pp. 11–23.

J. Bang, S. Yun, S. Kim, M. Choi, M. Lee, et al., 2020, “KsponSpeech: Korean spontaneous speech corpus for automatic speech recognition, Applied Sciences, vol. 10, pp. 6936–6953.

D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le, 2019, “SpecAugment: a simple data augmentation method for automatic speech recognition,” in Proc. INTERSPEECH 2019 – 20th Annual Conference of the International Speech Communication Association, Graz, Austria, pp. 2019–2680.

S. You, C. Liu, J. Li, 2021, “Improvement of vocal detection accuracy using convolutional neural networks,” KSII Transactions on Internet and Information Systems, vol. 15, pp. 729–748.

Q. Hou, D. Zhou, J. Feng, 2021, “Coordinate attention for efficient mobile network design,” in Proc. the IEEE/CVF Conference on Computer Vision, Pattern Recognition, pp. 13713–13722.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, et al., 2017, “Attention is all you need,” in Proc. the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, pp. 6000–6010.

C. Snoek, M. Worring, A. Smeulders, 2005, “Early versus late fusion in semantic video analysis,” in Proc. the 13th annual ACM international conference on Multimedia, Singapore, pp. 399-402.

D. Zhu, S. Liu, C. Liu, 2020, Design and Construction of Intelligent Voice Control System, in Proc. Artificial Intelligence in China, pp. 319–325.

B. Ojokoh, E. Adebisi, 2018, “A review of question answering systems, Journal of Web Engineering, vol. 17, pp. 717–758.

U. Yadav, 2021, “Efficient retrieval of data using semantic search engine based on NLP and RDF,” Journal of Web Engineering, vol. 20, pp. 717–758.

Downloads

Published

2023-04-20

How to Cite

Park, H. ., Chung, Y. ., & Kim, J.-H. . (2023). Deep Neural Networks-based Classification Methodologies of Speech, Audio and Music, and its Integration for Audio Metadata Tagging. Journal of Web Engineering, 22(01), 1–26. https://doi.org/10.13052/jwe1540-9589.2211

Issue

Section

ECTI