Data Analytics on Eco-Conditional Factors Affecting Speech Recognition Rate of Modern Interaction Systems
DOI:
https://doi.org/10.13052/jmm1550-4646.1849Keywords:
Interaction System, Eco-conditional factors, Recognition rate, ambient noise, human noise, utterance speed, frequencyAbstract
Speech-based Interaction systems contribute to the growing class of contemporary interactive techniques (Human-Computer Interactive system), which have emerged quickly in the last few years. Versatility, multi-channel synchronization, sensitivity, and timing are all notable characteristics of speech recognition. In addition, several variables influence the precision of voice interaction recognition. However, few researchers have done a significant study on the five eco-condition variables that tend to affect speech recognition rate (SRR): ambient noise, human noise, utterance speed, and frequency. The principal strategic goal of this research is to analyze the influence of the four variables mentioned earlier on SRR, and it includes many stages of experimentation on mixed noise speech data. The sparse representation-based analyzing technique is utilized to analyze the effects. Speech recognition is not noticeably affected by a person’s usual speaking pace. As a result, high-frequency voice signals are more easily recognized (∼∼98.12%) than low-frequency speech signals in noisy environments. By performing the experiments, the test results may help design the distributive controlling and commanding systems.
Downloads
References
“NOISEX-92.” http://www.speech.cs.cmu.edu/comp.speech/Section1/Data/noisex.html. [Online] Accessed: 2017-03-30. S18
Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 550–563, 2010. S41
Aron, J. (2011). How innovative is Apple’s new voice assistant, Siri? In: Elsevier. M10
B. Laperre, J. Amaya, and G. Lapenta, “Dynamic Time Warping as a New Evaluation for Dst Forecast With Machine Learning,” Frontiers in Astronomy and Space Sciences, vol. 7, Jul. 2020. DWt
B. Raj, T. Virtanen, S. Chaudhuri, and R. Singh, “Non-negative matrix factorization based compensation of music for automatic speech recognition,” in INTERSPEECH, pp. 717–720, 2010. S77
Bellegarda, J. R. (2014). Spoken language understanding for natural interaction: The Siri experience. In Natural Interaction with Robots, Knowbots and Smartphones (pp. 3–14): Springer. M11
C. Couvreur, V. Fontaine, P. Gaunard, and C. G. Mubikangiey, “Automatic classification of environmental noise events by hidden Markov models,” Applied Acoustics, vol. 54, no. 3, pp. 187–206, 1998. S2
C. Joder and B. Schuller, “Exploring nonnegative matrix factorization for audio classification: Application to speaker recognition,” in Speech Communication; 10. ITG Symposium; Proceedings of, pp. 1–4, VDE, 2012. S74
C. Müller, Speaker Classification II. Springer, 2007. S1
C. Tzagkarakis and A. Mouchtaris, “Sparsity based robust speaker identification using a discriminative dictionary learning approach,” in Signal Processing Conference (EUSIPCO), 2013 Proceedings of the 21st European, pp. 1–5, IEEE, 2013. S73
D. O’Shaughnessy (1989), “Enhancing speech degraded by additive noise or interfering speakers”. IEEE Commun. Mag., February 1989, pp. 46–52.
Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Williams, J. (2013). Recent advances in deep learning for speech research at Microsoft. Paper presented at the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. M9
G. J. Mysore, P. Smaragdis, and B. Raj, “Non-negative hidden Markov modeling of audio with application to source separation,” in International Conference on Latent Variable Analysis and Signal Separation, pp. 140–148, Springer, 2010. S75
J. A. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information theory, vol. 50, no. 10, pp. 2231–2242, 2004. S20
-J. F. Gemmeke, T. Virtanen, and A. Hurmalainen, “Exemplar-based sparse representations for noise robust automatic speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2067–2080, 2011. S76
J. Hernando and C. Nadeu (19941, “Speech recognition in noisy car environment based on OSALPC representation and robust similarity measuring techniques”, Proc. IEEE Internat. Con& Acoust. Speech Signal Process., Adelaide, Australia, April 1994, Vol. II, pp. 69-72.
J. Laroche, “Frequency-domain techniques for high-quality voice modification,” in Proc. of the 6th Int. Conference on Digital Audio Effects, Citeseer, 2003. S13
J. Le Roux, F. Weninger, and J. R. Hershey, “Sparse NMF–half-baked or well done?,” Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Tech. Rep., no. TR2015-023, 2015. S80
J. Nikunen and T. Virtanen, “Object-based audio coding using non-negative matrix factorization for the spectrogram representation,” in Audio Engineering Society Convention 128, Audio Engineering Society, 2010. S9
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon technical report, vol. 93, 1993. S43
J.M. Salavedra, E. Masgrau, A. Moreno and X. Jove (1993), “A speech enhancement system using higher order AR estimation in real environments”, Proc. European Con& Speech Technology, Berlin, 1993, Vol. 1, pp. 223–226.
J.S. Lim and A.V. Oppenheim (1978), “All pole modeling of degraded speech”, IEEE Trans. Acoust. Speech Signal Process., Vol. 26, pp. 197–210.
J.S. Lim and A.V. Oppenheim (1983), “Ah pole modeling of degraded speech”, in Speech Enhancement, ed. by J. Lim (Prentice-Hall, Englewood Cliffs, NJ). pp. 101–114.
K. S. Rao and B. Yegnanarayana, “Prosody modification using instants of significant excitation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 972–980, 2006. S10
K. V. V. Girish, T. V. Ananthapadmanabha, and A. G. Ramakrishnan, “Cosine similarity based dictionary learning and source recovery for classification of diverse audio sources,” in India Conference (INDICON), 2016 IEEE Annual, IEEE, 2016. S44
L. Lee and R. C. Rose, “Speaker normalization using efficient frequency warping procedures,” in Acoustics, Speech, and Signal Processing (ICASSP), 1996 IEEE International Conference on, vol. 1, pp. 353–356, IEEE, 1996. S11
L.M. Arslan and J.H.L. Hansen (19941, “Minimum cost based phoneme class detection for improved iterative speech enhancement”, Proc. IEEE Internat. Conf Acoust. Speech Signal Process., Adelaide, Australia, April 1994, Vol. II, pp. 45–48
M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval, and M. E. Davies, “Sparse representations in audio and music: from coding to source separation,” Proceedings of the IEEE, vol. 98, no. 6, pp. 995–1005, 2010. S8
M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer Publishing Com‘pany, Incorporated, 1st ed., 2010. S19
M. Feder, A.V. Oppenheim and E. Weinstein (19891, “Maximum likelihood noise cancellation using the EM algorithm”, IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-37, No. 2, pp. 204–216.
N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised speech enhancement using nonnegative matrix factorization,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2140–2151, 2013. S6
P. C. Loizou, “Speech enhancement based on perceptually motivated bayesian estimators of the magnitude spectrum,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 857–869, 2005. S4
Pandiyan, Sanjeevi, Ashwin M., Manikandan R., Karthick Raghunath K.M., and Anantha Raman G.R. “Heterogeneous Internet of Things Organization Predictive Analysis Platform for Apple Leaf Diseases Recognition.” Computer Communications 154 (March 2020): 99–110.
R. G. Malkin, Machine listening for context-aware computing. PhD thesis, Carnegie Mellon University Pittsburgh, PA, 2006. S12
Rabiner, L. R., Juang, B.-H., and Rutledge, J. C. (1993). Fundamentals of speech recognition (Vol. 14): PTR Prentice Hall Englewood Cliffs. M1
S. Nandkumar and J.H.L. Hansen (1994), “Speech enhancement based on a new set of auditory constrained parameters”, Proc. IEEE Internal. Conf. Acoust. Speech Signal Process.. Adelaide, Australia, April 1994. Vol. I, pp. 1–4.
S. Zubair, F. Yan, and W. Wang, “Dictionary learning based sparse coefficients for audio classification with max and average pooling,” Digital Signal Processing, vol. 23, no. 3, pp. 960–970, 2013. S79
S.F. Boll (19791, “Suppression of acoustic noise in speech using spectral subtraction”, IEEE Trans. Acoust. Speech Signal Process., April 1979, Vol. ASSP-27, No. 2, pp. 113–120.
T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE transactions on audio, speech, and language processing, vol. 15, no. 3, pp. 1066–1074, 2007. S3
V. Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,” Speech Communication, vol. 9, no. 4, pp. 351–356, 1990. S16
W. B. Kleijn and K. K. Paliwal, eds., Speech Coding and Synthesis. New York, NY, USA: Elsevier Science Inc., 1995. S14
Wagner, P., Malisz, Z., and Kopp, S. (2014). Gesture and speech in interaction: An overview. In: Elsevier. M2
Y. Ephraim and D. Malah (19841, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator”, ZEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-32, pp. 1109–1112.
Y. Hu and P. C. Loizou, “Subjective comparison and evaluation of speech enhancement algorithms,” Speech communication, vol. 49, no. 7, pp. 588–601, 2007. S5
Y.-C. Cho and S. Choi, “Nonnegative features of spectro-temporal sounds for classification,” Pattern Recognition Letters, vol. 26, no. 9, pp. 1327–1336, 2005. S78