Convolutional Neural Networks Using Log Mel-Spectrogram Separation for Audio Event Classification with Unknown Devices

Authors

DOI:

https://doi.org/10.13052/jwe1540-9589.21216

Keywords:

Audio event classification, unknown device, log mel-spectrogram, log mel-spectrogram separation, convolutional neural networks

Abstract

Audio event classification refers to the detection and classification of non-verbal signals, such as dog and horn sounds included in audio data, by a computer. Recently, deep neural network technology has been applied to audio event classification, exhibiting higher performance when compared to existing models. Among them, a convolutional neural network (CNN)-based training method that receives audio in the form of a spectrogram, which is a two-dimensional image, has been widely used. However, audio event classification has poor performance on test data when it is recorded by a device (unknown device) different from that used to record training data (known device). This is because the frequency range emphasized is different for each device used during recording, and the shapes of the resulting spectrograms generated by known devices and those generated by unknown devices differ. In this study, to improve the performance of the event classification system, a CNN based on the log mel-spectrogram separation technique was applied to the event classification system, and the performance of unknown devices was evaluated. The system can classify 16 types of audio signals. It receives audio data at 0.4-s length, and measures the accuracy of test data generated from unknown devices with a model trained via training data generated from known devices. The experiment showed that the performance compared to the baseline exhibited a relative improvement of up to 37.33%, from 63.63% to 73.33% based on Google Pixel, and from 47.42% to 65.12% based on the LG V50.

Downloads

Download data is not yet available.

Author Biographies

Soonshin Seo, Dept. of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea

Soonshin Seo received his B.A. degree in Linguistics and B.E. degree in Computer Science and Engineering from Hankuk University of Foreign Studies in 2018. He is currently pursuing a Ph.D. degree in Computer Science and Engineering at Sogang University. His research interests include speaker recognition and spoken multimedia content search.

Changmin Kim, LG Electronics, Seoul, Republic of Korea

Changmin Kim received his B.E. and M. E. degrees in Computer Science and Engineering from Sogang University in 2019 and 2021 respectively. He is a research engineer in LG Electronics Institute of Technology, where he was engaged in development of speech recognition and audio event classification for mobile devices. His research interests include speech recognition and spoken multimedia content.

Ji-Hwan Kim, Dept. of Computer Science and Engineering, Sogang University, Seoul, Republic of Korea

Ji-Hwan Kim received the B.E. and M.E. degrees in Computer Science from KAIST (Korea Advanced Institute of Science and Technology) in 1996 and 1998 respectively and Ph.D. degree in Engineering from the University of Cambridge in 2001. From 2001 to 2007, he was a chief research engineer and a senior research engineer in LG Electronics Institute of Technology, where he was engaged in development of speech recognizers for mobile devices. In 2004, he was a visiting scientist in MIT Media Lab. Since 2007, he has been a faculty member in the Department of Computer Science and Engineering, Sogang University. Currently, he is a full professor. His research interests include spoken multimedia content search, speech recognition for embedded systems and dialogue understanding.

References

G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649, 2012.

M. Lim, and J. Kim, “Audio event classification using deep neural networks,” Phonetics and Speech Sciences, vol. 7, no. 4, pp. 27–33, 2015.

M. Lim, D. Lee, H. Park, Y. Kang, J. Oh, and J. Kim, “Convolutional neural network based audio event classification,” KSII Transactions on Internet and Information Systems, vol. 12, no. 6, pp. 2748–2760, 2018.

H. Chen, Z. Liu, Z. Liu, P. Zhang, and Y. Yan, “Integrating the data augmentation scheme with various classifiers for acoustic scene modeling,” arXiv preprint arXiv:1907.06639, 2019.

M. Kosmider, “Calibrating neural networks for secondary recording devices,” in Proceedings of Detection and Classification of Acoustic Scenes and Events Workshop, pp. 25–26, 2019.

M. McDonnell, and W. Gao, “Acoustic scene classification using deep residual network with late fusion of separated high and low frequency paths,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2019.

H. Zeinali, L. Burget, and J. Cernocky, “Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge,” arXiv preprint arXiv:1810.04273, 2018.

M. Dorfer, B. Lehner, H. Eghbal-zadeh, H. Christop, P. Fabian, and W. Gerhard, “Acoustic scene classification with fully convolutional neural network and i-vectors,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2018.

Y. Sakashita, and M. Aono, “Acoustic scene classification by ensemble of spectrograms based on adaptive temporal divisions,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2018.

K. Piczak, “ESC: Dataset for environmental sound classification,” in Proceedings of the ACM International Conference on Multimedia, pp. 1015–1018, 2015.

J. Salamon, C. Jacoby, and J. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of ACM International Conference on Multimedia, pp. 1041–1044, 2014.

K. Piczak, “Environmental sound classification with convolutional neural networks,” in Proceedings of Machine Training for Signal Processing Workshop, pp. 1–6, 2015.

K. Piczak, “The details that matter: Frequency resolution of spectrograms in acoustic scene classification,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2017.

S. Phaye, E. Benetos, and Y. Wang, “Subspectralnet using sub-spectrogram based convolutional neural networks for acoustic scene classification,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 825–829, 2019.

S. Suh, S. Park, Y. Jeong and T. Lee, “Designing acoustic scene classification models with CNN variants,” Technical Report, Detection and Classification of Acoustic Scenes and Events Challenge, 2020.

J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad, and A. Serralheiro, “Non-speech audio event detection,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 1973–1976, 2009.

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2015.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual training for image recognition,” in Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.

M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.

C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet and the impact of residual connections on training,” arXiv preprint arXiv:1602.07261, 2016.

G. Dahl, T. Sainath and G. Hinton, “Improving deep neural networks for LVCSR using rectified linear units and dropout,” in Proceeding of International Conference on Acoustics, Speech and Signal Processing, pp. 8609–8613, 2013.

G. Grice, R. Nullmeyer, and V. Spiker, “Human reactiontime: toward a general theory,” Journal of Experimental Psychology: General, vol. 111, no. 1, pp. 135–153, 1982.

J. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. Moore, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 776–780, 2017.

H. Eghbal-zadeh, B. Lehner, M. Dorfer, and G. Widmer, “A hybrid approach using binaural i-vectors and deep convolutional neural networks,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2016.

M. Slaney, “Semantic-audio retrieval,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 1408–1411, 2002.

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, and S. Ghemawat, “Tensorflow: Large-scale machine training on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.

Downloads

Published

2022-01-22

Issue

Section

SPECIAL ISSUE ON Future Multimedia Contents and Technology on Web in the 5G Era