Convolutional Neural Networks Using Log Mel-Spectrogram Separation for Audio Event Classification with Unknown Devices
DOI:
https://doi.org/10.13052/jwe1540-9589.21216Keywords:
Audio event classification, unknown device, log mel-spectrogram, log mel-spectrogram separation, convolutional neural networksAbstract
Audio event classification refers to the detection and classification of non-verbal signals, such as dog and horn sounds included in audio data, by a computer. Recently, deep neural network technology has been applied to audio event classification, exhibiting higher performance when compared to existing models. Among them, a convolutional neural network (CNN)-based training method that receives audio in the form of a spectrogram, which is a two-dimensional image, has been widely used. However, audio event classification has poor performance on test data when it is recorded by a device (unknown device) different from that used to record training data (known device). This is because the frequency range emphasized is different for each device used during recording, and the shapes of the resulting spectrograms generated by known devices and those generated by unknown devices differ. In this study, to improve the performance of the event classification system, a CNN based on the log mel-spectrogram separation technique was applied to the event classification system, and the performance of unknown devices was evaluated. The system can classify 16 types of audio signals. It receives audio data at 0.4-s length, and measures the accuracy of test data generated from unknown devices with a model trained via training data generated from known devices. The experiment showed that the performance compared to the baseline exhibited a relative improvement of up to 37.33%, from 63.63% to 73.33% based on Google Pixel, and from 47.42% to 65.12% based on the LG V50.
Downloads
References
G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649, 2012.
M. Lim, and J. Kim, “Audio event classification using deep neural networks,” Phonetics and Speech Sciences, vol. 7, no. 4, pp. 27–33, 2015.
M. Lim, D. Lee, H. Park, Y. Kang, J. Oh, and J. Kim, “Convolutional neural network based audio event classification,” KSII Transactions on Internet and Information Systems, vol. 12, no. 6, pp. 2748–2760, 2018.
H. Chen, Z. Liu, Z. Liu, P. Zhang, and Y. Yan, “Integrating the data augmentation scheme with various classifiers for acoustic scene modeling,” arXiv preprint arXiv:1907.06639, 2019.
M. Kosmider, “Calibrating neural networks for secondary recording devices,” in Proceedings of Detection and Classification of Acoustic Scenes and Events Workshop, pp. 25–26, 2019.
M. McDonnell, and W. Gao, “Acoustic scene classification using deep residual network with late fusion of separated high and low frequency paths,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2019.
H. Zeinali, L. Burget, and J. Cernocky, “Convolutional neural networks and x-vector embedding for DCASE2018 acoustic scene classification challenge,” arXiv preprint arXiv:1810.04273, 2018.
M. Dorfer, B. Lehner, H. Eghbal-zadeh, H. Christop, P. Fabian, and W. Gerhard, “Acoustic scene classification with fully convolutional neural network and i-vectors,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2018.
Y. Sakashita, and M. Aono, “Acoustic scene classification by ensemble of spectrograms based on adaptive temporal divisions,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2018.
K. Piczak, “ESC: Dataset for environmental sound classification,” in Proceedings of the ACM International Conference on Multimedia, pp. 1015–1018, 2015.
J. Salamon, C. Jacoby, and J. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of ACM International Conference on Multimedia, pp. 1041–1044, 2014.
K. Piczak, “Environmental sound classification with convolutional neural networks,” in Proceedings of Machine Training for Signal Processing Workshop, pp. 1–6, 2015.
K. Piczak, “The details that matter: Frequency resolution of spectrograms in acoustic scene classification,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2017.
S. Phaye, E. Benetos, and Y. Wang, “Subspectralnet using sub-spectrogram based convolutional neural networks for acoustic scene classification,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 825–829, 2019.
S. Suh, S. Park, Y. Jeong and T. Lee, “Designing acoustic scene classification models with CNN variants,” Technical Report, Detection and Classification of Acoustic Scenes and Events Challenge, 2020.
J. Portelo, M. Bugalho, I. Trancoso, J. Neto, A. Abad, and A. Serralheiro, “Non-speech audio event detection,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 1973–1976, 2009.
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2015.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual training for image recognition,” in Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet and the impact of residual connections on training,” arXiv preprint arXiv:1602.07261, 2016.
G. Dahl, T. Sainath and G. Hinton, “Improving deep neural networks for LVCSR using rectified linear units and dropout,” in Proceeding of International Conference on Acoustics, Speech and Signal Processing, pp. 8609–8613, 2013.
G. Grice, R. Nullmeyer, and V. Spiker, “Human reactiontime: toward a general theory,” Journal of Experimental Psychology: General, vol. 111, no. 1, pp. 135–153, 1982.
J. Gemmeke, D. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. Moore, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 776–780, 2017.
H. Eghbal-zadeh, B. Lehner, M. Dorfer, and G. Widmer, “A hybrid approach using binaural i-vectors and deep convolutional neural networks,” Technical Report of Detection and Classification of Acoustic Scenes and Events Challenge, 2016.
M. Slaney, “Semantic-audio retrieval,” in Proceedings of International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 1408–1411, 2002.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, and S. Ghemawat, “Tensorflow: Large-scale machine training on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.