메뉴 건너뛰기




Volumn 42, Issue 4, 2015, Pages 722-737

Audio-visual speech recognition using deep learning

Author keywords

Audio visual speech recognition; Deep learning; Feature extraction; Multi stream HMM

Indexed keywords

ACOUSTIC NOISE; AUDIO ACOUSTICS; FEATURE EXTRACTION; HIDDEN MARKOV MODELS; LEARNING SYSTEMS; NEURAL NETWORKS; SIGNAL TO NOISE RATIO; SPEECH ANALYSIS; VOCABULARY CONTROL;

EID: 84939956018     PISSN: 0924669X     EISSN: 15737497     Source Type: Journal    
DOI: 10.1007/s10489-014-0629-7     Document Type: Article
Times cited : (567)

References (52)
  • 2
    • 84867605836 scopus 로고    scopus 로고
    • Penn G (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition
    • Speech,and Signal Processing, Kyoto
    • Abdel-Hamid O, rahman Mohamed A, Jiang H, Penn G (2012) Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech,and Signal Processing, Kyoto, pp 4277–4280
    • Proceedings of the IEEE International Conference on Acoustics , pp. 4277-4280
    • Abdel-Hamid, O.1    rahman Mohamed, A.2    Jiang, H.3
  • 3
    • 4544329810 scopus 로고    scopus 로고
    • Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 5, Montreal
    • Aleksic PS, Katsaggelos AK (2004) Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 5, Montreal, pp 917–920
    • (2004) pp 917–920
    • Aleksic, P.S.1    Katsaggelos, A.K.2
  • 4
    • 84977800621 scopus 로고    scopus 로고
    • Evidence of correlation between acoustic and visual features of speech. In: Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco
    • Barker J, Berthommier F (1999) Evidence of correlation between acoustic and visual features of speech. In: Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco, pp 5–9
    • (1999) pp 5–9
    • Barker, J.1    Berthommier, F.2
  • 5
    • 69349090197 scopus 로고    scopus 로고
    • Learning deep architectures for AI
    • Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127
    • (2009) Found Trends Mach Learn , vol.2 , Issue.1
    • Bengio, Y.1
  • 6
    • 0030355935 scopus 로고    scopus 로고
    • A new ASR approach based on independent processing and recombination of partial frequency bands. In: Proceedings of the 4th International Conference on Spoken Language Processing, vol 1, Philadelphia
    • Bourlard H, Dupont S (1996) A new ASR approach based on independent processing and recombination of partial frequency bands. In: Proceedings of the 4th International Conference on Spoken Language Processing, vol 1, Philadelphia, pp 426–429
    • (1996) pp 426–429
    • Bourlard, H.1    Dupont, S.2
  • 7
    • 84939943424 scopus 로고    scopus 로고
    • Ris C: Multi-stream speech recognition.IDIAP research report
    • Bourlard H, Dupont S, Ris C (1996) Multi-stream speech recognition.IDIAP research report
    • (1996) Dupont S
    • Bourlard, H.1
  • 9
    • 0022920273 scopus 로고    scopus 로고
    • (1986) Seeing speech: Investigations into the synthesis and recognition of visible speech movements using automatic image processing and computer graphics
    • Techniques and Applications, London
    • Brooke N, Petajan ED (1986) Seeing speech: Investigations into the synthesis and recognition of visible speech movements using automatic image processing and computer graphics. In: Proceedings of the International Conference on Speech Input and Output, Techniques and Applications, London, pp 104–109
    • Proceedings of the International Conference on Speech Input and Output , pp. 104-109
    • Brooke, N.1    Petajan, E.D.2
  • 10
    • 84894294885 scopus 로고    scopus 로고
    • Deep learning with COTS HPC. In: Proceedings of the 30th international conference on machine learning, Atlanta
    • Coates A, Huval B, Wang T, Wu DJ, Ng AY, Catanzaro B (2013) Deep learning with COTS HPC. In: Proceedings of the 30th international conference on machine learning, Atlanta, pp 1337–1345
    • (2013) pp 1337–1345
    • Coates, A.1    Huval, B.2    Wang, T.3    Wu, D.J.4    Ng, A.Y.5    Catanzaro, B.6
  • 12
    • 84055222005 scopus 로고    scopus 로고
    • Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition
    • Dahl GE, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30–42
    • (2012) IEEE Trans Audio Speech Lang Process , vol.20 , Issue.1 , pp. 30-42
    • Dahl, G.E.1    Acero, A.2
  • 13
    • 84905259759 scopus 로고    scopus 로고
    • Glass J (2014) Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition
    • Speech, and Signal Processing, Florence
    • Feng X, Zhang Y, Glass J (2014) Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Florence, pp 1759–1763
    • Proceedings of the IEEE International Conference on Acoustics , pp. 1759-1763
    • Feng, X.1    Zhang, Y.2
  • 14
    • 63449120701 scopus 로고    scopus 로고
    • Dynamic modality weighting for multi-stream HMMs in audio-visual speech recognition. In: Proceedings of the 10th International Conference on Multimodal Interfaces, Chania
    • Gurban M, Thiran JP, Drugman T, Dutoit T (2008) Dynamic modality weighting for multi-stream HMMs in audio-visual speech recognition. In: Proceedings of the 10th International Conference on Multimodal Interfaces, Chania, pp 237– 240
    • (2008) pp 237– 240
    • Gurban, M.1    Thiran, J.P.2    Drugman, T.3    Dutoit, T.4
  • 15
    • 85009284526 scopus 로고    scopus 로고
    • DCT-based video features for audio-visual speech recognition. In: Proceedings of the 7th International Conference on Spoken Language Processing, vol 3, Denver
    • Heckmann M, Kroschel K, Savariaux C (2002) DCT-based video features for audio-visual speech recognition. In: Proceedings of the 7th International Conference on Spoken Language Processing, vol 3, Denver, pp 1925–1928
    • (2002) pp 1925–1928
    • Heckmann, M.1    Kroschel, K.2    Savariaux, C.3
  • 16
    • 0033709098 scopus 로고    scopus 로고
    • Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 3, Istanbul
    • Hermansky H, Ellis D, Sharma S (2000) Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 3, Istanbul, pp 1635–1638
    • (2000) pp 1635–1638
    • Hermansky, H.1    Ellis, D.2    Sharma, S.3
  • 18
    • 33746600649 scopus 로고    scopus 로고
    • Reducing the dimensionality of data with neural networks
    • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–7
    • (2006) Science , vol.313 , Issue.5786 , pp. 504-507
    • Hinton, G.E.1    Salakhutdinov, R.R.2
  • 19
    • 84890465549 scopus 로고    scopus 로고
    • Kingsbury B (2013) Audio-visual deep learning for noise robust speech recognition
    • Speech, and Signal Processing, Vancouver
    • Huang J, Kingsbury B (2013) Audio-visual deep learning for noise robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vancouver, pp 7596–7599
    • Proceedings of the IEEE International Conference on Acoustics , pp. 7596-7599
    • Huang, J.1
  • 22
    • 84939939168 scopus 로고    scopus 로고
    • Hinton G: Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems
    • Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems
    • (2012) Sutskever , vol.I
    • Krizhevsky, A.1
  • 24
    • 82955182641 scopus 로고    scopus 로고
    • Improving visual features for lip-reading
    • Proceedings of the International Conference on Auditory-Visual Speech Processing, Hakone,Japan
    • Lan Y, Theobald BJ, Harvey R, Ong EJ, Bowden R (2010) Improving visual features for lip-reading. In: Proceedings of the International Conference on Auditory-Visual Speech Processing. Hakone,Japan
    • (2010) In
    • Lan, Y.1    Theobald, B.J.2    Harvey, R.3    Ong, E.J.4    Bowden, R.5
  • 25
    • 84867135575 scopus 로고    scopus 로고
    • Building high-level features using large scale unsupervised learning. In: Proceedings of the 29th International Conference on Machine Learning, Edinburgh
    • Le QV, Ranzato M, Monga R, Devin M, Chen K, Corrado GS, Dean J, Ng AY (2012) Building high-level features using large scale unsupervised learning. In: Proceedings of the 29th International Conference on Machine Learning, Edinburgh, pp 81–88
    • (2012) pp 81–88
    • Le, Q.V.1    Ranzato, M.2    Monga, R.3    Devin, M.4    Chen, K.5    Corrado, G.S.6    Dean, J.7    Ng, A.Y.8
  • 26
    • 5044231640 scopus 로고    scopus 로고
    • Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2, Washington
    • LeCun Y, Bottou L (2004) Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2, Washington, pp 97–104
    • (2004) pp 97–104
    • LeCun, Y.1    Bottou, L.2
  • 27
    • 0032203257 scopus 로고    scopus 로고
    • Gradient-based learning applied to document recognition
    • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
    • (1998) Proc IEEE , vol.86 , Issue.11 , pp. 2278-2324
    • LeCun, Y.1    Bottou, L.2    Bengio, Y.3    Haffner, P.4
  • 28
    • 71149119164 scopus 로고    scopus 로고
    • Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th International Conference on Machine Learning, Montreal
    • Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th International Conference on Machine Learning, Montreal, pp 609– 616
    • (2009) pp 609– 616
    • Lee, H.1    Grosse, R.2    Ranganath, R.3    Ng, A.Y.4
  • 29
    • 84863380535 scopus 로고    scopus 로고
    • Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Proceedings of the Advances in Neural Information Processing Systems 22, Vancouver
    • Lee H, Pham P, Largman Y, Ng AY (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Proceedings of the Advances in Neural Information Processing Systems 22, Vancouver, pp 1096–1104
    • (2009) pp 1096–1104
    • Lee, H.1    Pham, P.2    Largman, Y.3    Ng, A.Y.4
  • 30
    • 0032822143 scopus 로고    scopus 로고
    • A comparative study of neural network based feature extraction paradigms
    • Lerner B, Guterman H, Aladjem M, Dinstein I (1999) A comparative study of neural network based feature extraction paradigms. Pattern Recogn Lett 20(1):7–14
    • (1999) Pattern Recogn Lett , vol.20 , Issue.1 , pp. 7-14
    • Lerner, B.1    Guterman, H.2    Aladjem, M.3    Dinstein, I.4
  • 31
    • 0029765665 scopus 로고    scopus 로고
    • Visual speech recognition using active shape models and hidden Markov models. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 2, Atlanta
    • Luettin J, Thacker N, Beet S (1996) Visual speech recognition using active shape models and hidden Markov models. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 2, Atlanta, pp 817–820
    • (1996) pp 817–820
    • Luettin, J.1    Thacker, N.2    Beet, S.3
  • 32
    • 84977820250 scopus 로고    scopus 로고
    • Recurrent neural network feature enhancement: The 2nd chime challenge. In: Proceedings of the 2nd International Workshop on Machine Listening in Multisource Environments.Vancouver
    • Maas AL, O’Neil TM, Hannun AY, Ng AY (2013) Recurrent neural network feature enhancement: The 2nd chime challenge. In: Proceedings of the 2nd International Workshop on Machine Listening in Multisource Environments.Vancouver, Canada
    • (2013) Canada
    • Maas, A.L.1    O’Neil, T.M.2    Hannun, A.Y.3    Ng, A.Y.4
  • 40
    • 0000255539 scopus 로고
    • Fast exact multiplication by the Hessian
    • Pearlmutter B (1994) Fast exact multiplication by the Hessian. Neural Comput 6(1):147–160
    • (1994) Neural Comput , vol.6 , Issue.1 , pp. 147-160
    • Pearlmutter, B.1
  • 42
    • 0004762797 scopus 로고    scopus 로고
    • Exploiting sensor fusion architectures and stimuli complementarity in av speech recognition
    • Stork D, Hennecke M, (eds), Springer, Berlin Heidelberg
    • Robert-Ribes J, Piquemal M, Schwartz JL, Escudier P (1996) Exploiting sensor fusion architectures and stimuli complementarity in av speech recognition. In: Stork D, Hennecke M (eds) Speechreading by Humans and Machines. Springer, Berlin Heidelberg, pp 193–210
    • (1996) Speechreading by Humans and Machines , pp. 193-210
    • Robert-Ribes, J.1    Piquemal, M.2    Schwartz, J.L.3    Escudier, P.4
  • 43
    • 84867593213 scopus 로고    scopus 로고
    • Auto-encoder bottleneck features using deep belief networks. In:Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Kyoto
    • Sainath TN, Kingsbury B, Ramabhadran B (2012) Auto-encoder bottleneck features using deep belief networks. In:Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Kyoto, pp 4153–4156
    • (2012) pp 4153–4156
    • Sainath, T.N.1    Kingsbury, B.2    Ramabhadran, B.3
  • 44
    • 0035791204 scopus 로고    scopus 로고
    • Reilly R (2001) Feature analysis for automatic speechreading
    • Multimedia Signal, Cannes
    • Scanlon P, Reilly R (2001) Feature analysis for automatic speechreading. In: Proceedings of the IEEE 4th Workshop on Processing, Multimedia Signal, Cannes, pp 625–630
    • Proceedings of the IEEE 4th Workshop on Processing , pp. 625-630
    • Scanlon, P.1
  • 45
    • 0036631778 scopus 로고    scopus 로고
    • Fast curvature matrix-vector products for second-order gradient descent
    • Schraudolph NN (2002) Fast curvature matrix-vector products for second-order gradient descent. Neural Comput 14(7):1723–38
    • (2002) Neural Comput , vol.14 , Issue.7 , pp. 1723-1738
    • Schraudolph, N.N.1
  • 46
    • 0004213132 scopus 로고    scopus 로고
    • Auditory toolbox: A MATLAB toolbox for auditory modeling work version 2
    • Slaney M (1998) Auditory toolbox: A MATLAB toolbox for auditory modeling work version 2. Interval research corproation
    • (1998) Interval research corproation
    • Slaney, M.1
  • 47
    • 80053459857 scopus 로고    scopus 로고
    • Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning, Bellevue
    • Sutskever I, Martens J, Hinton G (2011) Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning, Bellevue, pp 1017–1024
    • (2011) pp 1017–1024
    • Sutskever, I.1    Martens, J.2    Hinton, G.3
  • 48
    • 56449089103 scopus 로고    scopus 로고
    • Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning, New York
    • Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning, New York, pp 1096–1103
    • (2008) pp 1096–1103
    • Vincent, P.1    Larochelle, H.2    Bengio, Y.3    Manzagol, P.A.4
  • 49
    • 79551480483 scopus 로고    scopus 로고
    • Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion
    • Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
    • (2010) J Mach Learn Res , vol.11 , pp. 3371-3408
    • Vincent, P.1    Larochelle, H.2    Lajoie, I.3    Bengio, Y.4    Manzagol, P.A.5
  • 50
    • 0032178592 scopus 로고    scopus 로고
    • Quantitative association of vocal-tract and facial behavior
    • Yehia H, Rubin P, Vatikiotis-Bateson E (1998) Quantitative association of vocal-tract and facial behavior. Speech Comm 26:23–43
    • (1998) Speech Comm , vol.26 , pp. 23-43
    • Yehia, H.1    Rubin, P.2    Vatikiotis-Bateson, E.3
  • 51
    • 77950563943 scopus 로고    scopus 로고
    • Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: Proceedings of the 9th IEEE-RAS International Conference on Humanoid Robots, Paris
    • Yoshida T, Nakadai K, Okuno HG (2009) Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: Proceedings of the 9th IEEE-RAS International Conference on Humanoid Robots, Paris, pp 604–609
    • (2009) pp 604–609
    • Yoshida, T.1    Nakadai, K.2    Okuno, H.G.3
  • 52
    • 84939954092 scopus 로고    scopus 로고
    • Young S, Evermann G, Gales M, Hain T, Liu XA, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2009) The HTK Book (for HTK Version 3.4),.Cambridge University Engineering Department
    • Young S, Evermann G, Gales M, Hain T, Liu XA, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2009) The HTK Book (for HTK Version 3.4),.Cambridge University Engineering Department


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.