##plugins.themes.bootstrap3.article.main##

The importance of speech command recognition in a human-machine interaction system is increased in recent years. In this study, we propose a deep neural network-based system for acoustic and throat command speech recognition. We apply a preprocessed pipeline to create the input of the deep learning model. Firstly, speech commands are decomposed into components using well-known signal decomposition techniques. The Mel-frequency cepstral coefficients (MFCC) feature extraction method is applied to each component of the speech commands to obtain the feature inputs for the recognition system. At this stage, we apply and compare performance using different speech decomposition techniques such as wavelet packet decomposition (WPD), continuous wavelet transform (CWT), and empirical mode decomposition (EMD) in order to find out the best technique for our model. We observe that WPD shows the best performance in terms of classification accuracy. This paper investigates long short-term memory (LSTM)-based recurrent neural network (RNN), which is trained using the extracted MFCC features. The proposed neural network is trained and tested using acoustic speech commands. Moreover, we also train and test the proposed model using a throat mic. speech commands as well. Lastly, the transfer learning technique is employed to increase the test accuracy for throat speech recognition. The weights of the model train with the acoustic signal are used to initialize the model used for throat speech recognition. Overall, we have found significant classification accuracy for both acoustic and throat command speech. We obtain LSTM is much better than the GMM-HMM model, convolutional neural networks such as CNN-tpool2 and residual networks such as res15 and res26 with an accuracy score of over 97% on Google’s Speech Commands dataset and we achieve 95.35% accuracy on our throat speech data set using the transfer learning technique.

References

  1. McClelland JL, Elman JL. The trace model of speech perception. Cognitive Psychology. 1986; 18(1): 1-86.
     Google Scholar
  2. Bourlard H, Morgan N. Connectionist speech recognition: a hybrid approach. Kluwer Academic; 1994.
     Google Scholar
  3. Hinton G, Deng L, Yu D, Dahl G, Mohamed AR, Jaitly N, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine. 2012; 29: 82-97.
     Google Scholar
  4. Mesaros A, Heittola T, Eronen A, Virtanen T. Acoustic event detection in real-life recordings. Proceedings of 18th European Signal Processing Conference. 2010: 1257-1271.
     Google Scholar
  5. Jo J, Yoo H, Park IC. Energy-efficient floating-point MFCC extraction architecture for speech recognition systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 2016; 24(2): 754-758.
     Google Scholar
  6. Sainath TN, Parada C. Convolutional neural networks for small-footprint keyword spotting. Proceedings of Interspeech. 2015: 1478-1482.
     Google Scholar
  7. Tang R, Lin J. Deep residual learning for small-footprint keyword spotting. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2018: 5484-5488.
     Google Scholar
  8. Warden P. Speech commands: a dataset for limited-vocabulary speech recognition. ArXiv. 2018 Apr 9; 1-11.doi: 10.48550/arXiv.1804.03209.
     Google Scholar
  9. Mahfuz RA, Moni MA, Lio P, Islam SM, Berkovsky S, Khushi M, Quinn MJ. Deep convolutional neural networks based ECG beats classification to diagnose cardiovascular conditions. Biomedical Engineering Letters. 2021; 11: 1-16.
     Google Scholar
  10. Amiri GG, Asadi A. Comparison of different methods of wavelet and wavelet packet transform in processing ground motion records. International Journal of Civil Engineering. 2009; 7(4): 248-257.
     Google Scholar
  11. Zeiler A, Faltermeier R, Keck IR, A. Tome M, Puntonet CG, Lang E W. Empirical mode decomposition- an introduction. Proceedings of International Joint Conference on Neural Networks, IJCNN. 2010.
     Google Scholar
  12. Molla MKI, Das S, Hamid ME, Hirose K. Empirical mode decomposition for advanced speech signal processing. Journal of Signal Processing. 2013; 17: 215-229.
     Google Scholar
  13. Alim SA, Rashid NKA. Some commonly used speech feature extraction algorithms. Intechopen. 2018.
     Google Scholar
  14. Dave N. Feature extraction methods LPC, PLP and MFCC in speech recognition. International Journal for Advance Research in Engineering and Technology. 2013; 1(6): 1-5.
     Google Scholar
  15. Gold B, Morgan N, Ellis D. Speech and audio signal processing: processing and perception of speech and music, John Willy & Sons, 2002: 189-203.
     Google Scholar
  16. Memom S, Lech M, He L. Using information theoretic vector quantization for inverted MFCC based speaker verification. Proceedings of 2nd International Conference on Computer, Control and Communication. 2009.
     Google Scholar
  17. Hochreiter S, Schmidhuber J. Long Short-term memory. Neural Computation. 1997; 9(8): 1735-1780.
     Google Scholar
  18. Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Computation. 2019; 31(7): 1235-1270.
     Google Scholar
  19. Sak H, Senior A, Beaufays F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Proceedings of Interspeech. 2014: 338-342.
     Google Scholar
  20. Li X, Wu X. Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015.
     Google Scholar
  21. Bengio Y. Learning deep architectures for AI. Foundations. 2009; 2(1): 1-127.
     Google Scholar