메뉴 건너뛰기




Volumn 32, Issue 3, 2015, Pages 35-52

Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends

Author keywords

[No Author keywords available]

Indexed keywords

COMPLEX NETWORKS; HIDDEN MARKOV MODELS; MARKOV PROCESSES; SPEECH; TRELLIS CODES;

EID: 85032750981     PISSN: 10535888     EISSN: None     Source Type: Journal    
DOI: 10.1109/MSP.2014.2359987     Document Type: Review
Times cited : (238)

References (86)
  • 1
    • 84876687945 scopus 로고    scopus 로고
    • Speech synthesis based on hidden Markov models
    • K. Tokuda, Y. Nankaku, T. Toda, H. Zen, H. Yamagishi, and K. Oura, "Speech synthesis based on hidden Markov models," Proc. IEEE, vol. 101, no. 5, pp. 1234-1252, 2013.
    • (2013) Proc. IEEE , vol.101 , Issue.5 , pp. 1234-1252
    • Tokuda, K.1    Nankaku, Y.2    Toda, T.3    Zen, H.4    Yamagishi, H.5    Oura, K.6
  • 2
    • 57749193836 scopus 로고    scopus 로고
    • Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory
    • T. Toda, A. Black, and K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory," IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2222-2235, 2007.
    • (2007) IEEE Trans. Audio Speech Lang. Process. , vol.15 , Issue.8 , pp. 2222-2235
    • Toda, T.1    Black, A.2    Tokuda, K.3
  • 3
    • 33846405723 scopus 로고    scopus 로고
    • Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005
    • H. Zen, T. Toda, M. Nakamura, and K. Tokuda, "Details of Nitech HMM-based speech synthesis system for the Blizzard Challenge 2005," IEICE Trans. Inf. Syst., vol. E90-D, no. 1, pp. 325-333, 2007.
    • (2007) IEICE Trans. Inf. Syst. , vol.E90-D , Issue.1 , pp. 325-333
    • Zen, H.1    Toda, T.2    Nakamura, M.3    Tokuda, K.4
  • 5
    • 67651002140 scopus 로고    scopus 로고
    • Statistical parametric speech synthesis
    • H. Zen, K. Tokuda, and A. Black, "Statistical parametric speech synthesis,"Speech Commun., vol. 51, no. 11, pp. 1039-1064, 2009.
    • (2009) Speech Commun. , vol.51 , Issue.11 , pp. 1039-1064
    • Zen, H.1    Tokuda, K.2    Black, A.3
  • 8
    • 33749573927 scopus 로고    scopus 로고
    • Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences
    • H. Zen, K. Tokuda, and T. Kitamura, "Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences," Comput. Speech Lang., vol. 21, no. 1, pp. 153-173, 2006.
    • (2006) Comput. Speech Lang. , vol.21 , Issue.1 , pp. 153-173
    • Zen, H.1    Tokuda, K.2    Kitamura, T.3
  • 10
    • 84897902941 scopus 로고    scopus 로고
    • Statistical parametric speech synthesis based on Gaussian process regression
    • T. Koriyama, T. Nose, and T. Kobayashi, "Statistical parametric speech synthesis based on Gaussian process regression," IEEE J. Select. Topics Signal Processing, vol. 8, no. 2, pp. 173-183, 2014.
    • (2014) IEEE J. Select.Topics Signal Processing , vol.8 , Issue.2 , pp. 173-183
    • Koriyama, T.1    Nose, T.2    Kobayashi, T.3
  • 12
    • 84867214032 scopus 로고    scopus 로고
    • Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis
    • Y.-J. Wu and K. Tokuda, "Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis," in Proc. Interspeech, 2008, pp. 577-580.
    • Proc. Interspeech, 2008 , pp. 577-580
    • Wu, Y.-J.1    Tokuda, K.2
  • 13
    • 38549096029 scopus 로고    scopus 로고
    • A speech parameter-generation algorithm considering global variance for HMM-based speech synthesis
    • T. Toda and K. Tokuda, "A speech parameter-generation algorithm considering global variance for HMM-based speech synthesis," IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 816-824, 2007.
    • (2007) IEICE Trans. Inf. Syst. , vol.E90-D , Issue.5 , pp. 816-824
    • Toda, T.1    Tokuda, K.2
  • 14
    • 77953715694 scopus 로고    scopus 로고
    • Statistical text-to-speech synthesis based on segment-wise representation with a norm constraint
    • T. Tiomkin, D. Malah, and S. Shechtman, "Statistical text-to-speech synthesis based on segment-wise representation with a norm constraint," IEEE Trans. Audio Speech Lang. Processing, vol. 18, no. 5, pp. 1077-1082, 2010.
    • (2010) IEEE Trans. Audio Speech Lang. Processing , vol.18 , Issue.5 , pp. 1077-1082
    • Tiomkin, T.1    Malah, D.2    Shechtman, S.3
  • 15
    • 84901793334 scopus 로고    scopus 로고
    • Minimum Kullback-Leibler divergence parametergeneration for HMM-based speech synthesis
    • Z.-H. Ling and L.-R. Dai, "Minimum Kullback-Leibler divergence parametergeneration for HMM-based speech synthesis," IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 5, pp. 1492-1502, 2012.
    • (2012) IEEE Trans. Audio Speech Lang. Processing , vol.20 , Issue.5 , pp. 1492-1502
    • Ling, Z.-H.1    Dai, L.-R.2
  • 16
    • 33745805403 scopus 로고    scopus 로고
    • A fast learning algorithm for deep belief nets
    • G. Hinton, S. Osindero, and Y.-W. Teh, "A fast learning algorithm for deep belief nets," Neural Computat., vol. 18, no. 7, pp. 1527-1554, 2006.
    • (2006) Neural Computat. , vol.18 , Issue.7 , pp. 1527-1554
    • Hinton, G.1    Osindero, S.2    Teh, Y.-W.3
  • 17
    • 33746600649 scopus 로고    scopus 로고
    • Reducing the dimensionality of data with neural networks
    • G. Hinton and R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," Science, vol. 313, no. 5786, pp. 504-507, 2006.
    • (2006) Science , vol.313 , Issue.5786 , pp. 504-507
    • Hinton, G.1    Salakhutdinov, R.2
  • 19
    • 0000329993 scopus 로고
    • Information processing in dynamical systems: Foundations of harmony theory
    • D. E. Rumelhart and J. L. McClelland, Eds. Cambridge, MA: MIT Press ch. 6
    • P. Smolensky, "Information processing in dynamical systems: Foundations of harmony theory," in Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, Eds. Cambridge, MA: MIT Press, 1986, vol. 1, ch. 6, pp. 194-281.
    • (1986) Parallel Distributed Processing , vol.1 , pp. 194-281
    • Smolensky, P.1
  • 21
    • 79551480483 scopus 로고    scopus 로고
    • Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion
    • Dec.
    • P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, "Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion," J. Mach. Learn. Res., vol. 11, pp. 3371-3408, Dec. 2010.
    • (2010) J. Mach. Learn. Res. , vol.11 , pp. 3371-3408
    • Vincent, P.1    Larochelle, H.2    Lajoie, I.3    Bengio, Y.4    Manzagol, P.5
  • 24
    • 84901237776 scopus 로고    scopus 로고
    • Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis
    • Z.-H. Ling, L. Deng, and D. Yu, "Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis,"IEEE Trans. Audio Speech Lang. Processing, vol. 21, no. 10, pp. 2129-2139, 2013.
    • (2013) IEEE Trans. Audio Speech Lang. Processing , vol.21 , Issue.10 , pp. 2129-2139
    • Ling, Z.-H.1    Deng, L.2    Yu, D.3
  • 27
    • 84906225084 scopus 로고    scopus 로고
    • Joint spectral distribution modeling using restricted Boltzmann machines for voice conversion
    • L.-H. Chen, Z.-H. Ling, Y. Song, and L.-R. Dai, "Joint spectral distribution modeling using restricted Boltzmann machines for voice conversion," in Proc. Interspeech, 2013, pp. 3052-3056.
    • Proc. Interspeech, 2013 , pp. 3052-3056
    • Chen, L.-H.1    Ling, Z.-H.2    Song, Y.3    Dai, L.-R.4
  • 29
    • 84889579519 scopus 로고    scopus 로고
    • Conditional restricted Boltzmann machine for voice conversion
    • Z.-Z Wu, E.S. Chng, and H.-Z. Li, "Conditional restricted Boltzmann machine for voice conversion," in Proc. ChinaSIP, 2013, pp. 104-108.
    • Proc. ChinaSIP, 2013 , pp. 104-108
    • Wu, Z.-Z.1    Chng, E.S.2    Li, H.-Z.3
  • 30
  • 31
    • 84906279378 scopus 로고    scopus 로고
    • Speech Enhancement with Weighted Denoising Autoencoder
    • B.-Y. Xia and C.-C. Bao, "Speech enhancement with weighted denoising autoencoder,"in Proc. Interspeech, 2013, pp. 3444-3448.
    • Proc. Interspeech, 2013 , pp. 3444-3448
    • Xia, B.-Y.1    Bao, C.-C.2
  • 32
    • 84889257121 scopus 로고    scopus 로고
    • An experimental study on speech enhancement based on deep neural networks
    • Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "An experimental study on speech enhancement based on deep neural networks," IEEE Signal Processing Lett., vol. 21, no. 1, pp. 65-68, 2014.
    • (2014) IEEE Signal Processing Lett. , vol.21 , Issue.1 , pp. 65-68
    • Xu, Y.1    Du, J.2    Dai, L.-R.3    Lee, C.-H.4
  • 34
    • 84929157442 scopus 로고    scopus 로고
    • Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis
    • H. Lu, S. King, and O. Watts, "Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis," in Proc. ISCA SSW8, 2013, pp. 261-265.
    • Proc. ISCA SSW8, 2013 , pp. 261-265
    • Lu, H.1    King, S.2    Watts, O.3
  • 35
    • 84910030421 scopus 로고    scopus 로고
    • Statistical parametric speech synthesis using weighted multi-distribution deep belief network
    • S.-Y. Kang and H. Meng, "Statistical parametric speech synthesis using weighted multi-distribution deep belief network," in Proc. Interspeech, 2014, pp. 1959-1963.
    • Proc. Interspeech, 2014 , pp. 1959-1963
    • Kang, S.-Y.1    Meng, H.2
  • 40
    • 85032764981 scopus 로고    scopus 로고
    • Dynamic noise aware training for speech enhancement based on deep neural networks
    • to be published
    • Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "Dynamic noise aware training for speech enhancement based on deep neural networks," in Proc. Interspeech (to be published).
    • Proc. Interspeech
    • Xu, Y.1    Du, J.2    Dai, L.-R.3    Lee, C.-H.4
  • 43
    • 33847129573 scopus 로고    scopus 로고
    • Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training
    • J. Yamagishi and T. Kobayashi, "Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training," IEICE Trans. Inf. Syst., vol. E90-D, no. 2, pp. 533-543, 2007.
    • (2007) IEICE Trans. Inf. Syst. , vol.E90-D , Issue.2 , pp. 533-543
    • Yamagishi, J.1    Kobayashi, T.2
  • 44
    • 24144497811 scopus 로고    scopus 로고
    • Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis
    • J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, "Acoustic modeling of speaking styles and emotional expressions in HMM-based speech synthesis," IEICE Trans. Inf. Syst., vol. E88-D, no. 3, pp. 503-509, 2005.
    • (2005) IEICE Trans. Inf. Syst. , vol.E88-D , Issue.3 , pp. 503-509
    • Yamagishi, J.1    Onishi, K.2    Masuko, T.3    Kobayashi, T.4
  • 45
    • 29144475179 scopus 로고    scopus 로고
    • Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing
    • M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, "Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing," IEICE Trans. Inf. Syst., vol. E88-D, no. 11, pp. 2484-2491, 2005.
    • (2005) IEICE Trans. Inf. Syst. , vol.E88-D , Issue.11 , pp. 2484-2491
    • Tachibana, M.1    Yamagishi, J.2    Masuko, T.3    Kobayashi, T.4
  • 46
    • 51449114529 scopus 로고    scopus 로고
    • A style control technique for HMM-based expressive speech synthesis
    • T. Nose, J. Yamagishi, and T. Kobayashi, "A style control technique for HMM-based expressive speech synthesis," IEICE Trans. Inf. Syst., vol. E90-D, no. 9, pp. 1406-1413, 2007.
    • (2007) IEICE Trans. Inf. Syst. , vol.E90-D , Issue.9 , pp. 1406-1413
    • Nose, T.1    Yamagishi, J.2    Kobayashi, T.3
  • 47
    • 84862291337 scopus 로고    scopus 로고
    • Vocal tract length normalization for statistical parametric speech synthesis
    • L. Saheer, J. Dines, and P. N. Garner, "Vocal tract length normalization for statistical parametric speech synthesis," IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 7, pp. 2134-2148, 2012.
    • (2012) IEEE Trans. Audio Speech Lang. Processing , vol.20 , Issue.7 , pp. 2134-2148
    • Saheer, L.1    Dines, J.2    Garner, P.N.3
  • 49
    • 84869440340 scopus 로고    scopus 로고
    • Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression
    • Z.-H. Ling, K. Richmond, and J. Yamagishi, "Articulatory control of HMM-based parametric speech synthesis using feature-space-switched multiple regression," IEEE Trans. Audio Speech Lang. Processing, vol. 21, no. 1, pp. 207-219, 2013.
    • (2013) IEEE Trans. Audio Speech Lang. Processing , vol.21 , Issue.1 , pp. 207-219
    • Ling, Z.-H.1    Richmond, K.2    Yamagishi, J.3
  • 51
    • 79955538498 scopus 로고    scopus 로고
    • Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis
    • K. Yu, H. Zen, F. Mairesse, and S. Young, "Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis,"Speech Commun., vol. 53, no. 6, pp. 914-923, 2011.
    • (2011) Speech Commun. , vol.53 , Issue.6 , pp. 914-923
    • Yu, K.1    Zen, H.2    Mairesse, F.3    Young, S.4
  • 53
  • 63
    • 84865698185 scopus 로고    scopus 로고
    • Statistical voice conversion techniques for body-conducted unvoiced speech enhancement
    • T. Toda, M. Nakagiri, and K. Shikano, "Statistical voice conversion techniques for body-conducted unvoiced speech enhancement," IEEE Trans. Audio Speech Lang. Processing, vol. 20, no. 9, pp. 2505-2517, 2012.
    • (2012) IEEE Trans. Audio Speech Lang. Processing , vol.20 , Issue.9 , pp. 2505-2517
    • Toda, T.1    Nakagiri, M.2    Shikano, K.3
  • 64
    • 38649140222 scopus 로고    scopus 로고
    • Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model
    • T. Toda, A. Black, and K. Tokuda, "Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model," Speech Commun., vol. 50, pp. 215-227, 2008.
    • (2008) Speech Commun. , vol.50 , pp. 215-227
    • Toda, T.1    Black, A.2    Tokuda, K.3
  • 65
    • 84994214710 scopus 로고    scopus 로고
    • Deep learning in speech synthesis
    • Available
    • H. Zen. (2013). Deep learning in speech synthesis. Keynote speech given at ISCA SSW8. [Online]. Available: http://research.google.com/pubs/archive/41539.pdf
    • (2013) Keynote Speech Given at ISCA SSW8. [Online]
    • Zen, H.1
  • 67
    • 0036165806 scopus 로고    scopus 로고
    • An overlapping-feature based phonological model incorporating linguistic constraints: Applications to speech recognition
    • J. Sun and L. Deng, "An overlapping-feature based phonological model incorporating linguistic constraints: Applications to speech recognition," J. Acoust. Soc. Am., vol. 111, pp. 1086-1101, 2002.
    • (2002) J. Acoust. Soc. Am. , vol.111 , pp. 1086-1101
    • Sun, J.1    Deng, L.2
  • 68
    • 0031198059 scopus 로고    scopus 로고
    • Production models as a structural basis for automatic speech recognition
    • Aug.
    • L. Deng, G. Ramsay, and D. Sun, "Production models as a structural basis for automatic speech recognition," Speech Commun., vol. 33, nos. 2-3, pp. 93-111, Aug. 1997.
    • (1997) Speech Commun. , vol.33 , Issue.2-3 , pp. 93-111
    • Deng, L.1    Ramsay, G.2    Sun, D.3
  • 69
    • 33744966595 scopus 로고    scopus 로고
    • Switching dynamic system models for speech articulation and acoustics
    • New York: Springer-Verlag
    • L. Deng, "Switching dynamic system models for speech articulation and acoustics," in Mathematical Foundations of Speech and Language Processing. New York: Springer-Verlag, 2003, pp. 115-134.
    • (2003) Mathematical Foundations of Speech and Language Processing , pp. 115-134
    • Deng, L.1
  • 72
    • 84055222005 scopus 로고    scopus 로고
    • Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition
    • G. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," IEEE Trans. Speech Audio Processing, vol. 20, no. 1, pp. 30-42, 2012.
    • (2012) IEEE Trans. Speech Audio Processing , vol.20 , Issue.1 , pp. 30-42
    • Dahl, G.1    Yu, D.2    Deng, L.3    Acero, A.4
  • 74
    • 84886829539 scopus 로고    scopus 로고
    • Optimization techniques to improve training speed of deep neural networks for large speech tasks
    • T.N. Sainath, B. Kingsbury, H. Soltau, and B. Ramabhadran, "Optimization techniques to improve training speed of deep neural networks for large speech tasks," IEEE Trans. Audio Speech Lang. Processing, vol. 21, no. 11, pp. 2267-2276, 2013.
    • (2013) IEEE Trans. Audio Speech Lang. Processing , vol.21 , Issue.11 , pp. 2267-2276
    • Sainath, T.N.1    Kingsbury, B.2    Soltau, H.3    Ramabhadran, B.4
  • 75
    • 84872300403 scopus 로고    scopus 로고
    • Deep belief networks based voice activity detection
    • X.-L. Zhang and Ji Wu, "Deep belief networks based voice activity detection,"IEEE Trans. Audio Speech Lang. Processing, vol. 21, no. 4, pp. 697-710, 2013.
    • (2013) IEEE Trans. Audio Speech Lang. Processing , vol.21 , Issue.4 , pp. 697-710
    • Zhang, X.-L.1    Wu, J.2
  • 77
    • 0013344078 scopus 로고    scopus 로고
    • Training products of experts by minimizing contrastive divergence
    • G. Hinton, "Training products of experts by minimizing contrastive divergence,"Neural Computat., vol. 14, no. 8, pp. 1711-1800, 2002.
    • (2002) Neural Computat. , vol.14 , Issue.8 , pp. 1711-1800
    • Hinton, G.1
  • 80
    • 44049116681 scopus 로고
    • Connectionist learning of belief networks
    • R. Neal, "Connectionist learning of belief networks," Artificial Intell., vol. 56, no. 1, pp. 71-113, 1992.
    • (1992) Artificial Intell. , vol.56 , Issue.1 , pp. 71-113
    • Neal, R.1
  • 81
    • 0022471098 scopus 로고
    • Learning representations by backpropagating errors
    • D. Rumelhart, G. Hinton, and R. Williams, "Learning representations by backpropagating errors," Nature, vol. 323, no. 6088, pp. 533-536, 1986.
    • (1986) Nature , vol.323 , Issue.6088 , pp. 533-536
    • Rumelhart, D.1    Hinton, G.2    Williams, R.3
  • 82
    • 0041914606 scopus 로고    scopus 로고
    • Gradient flow in recurrent nets: The difficulty of learning long-term dependencies
    • S. Kremer and J. Kolen, Eds. Piscataway, NJ: IEEE Press
    • S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: The difficulty of learning long-term dependencies," in A Field Guide to Dynamical Recurrent Neural Networks, S. Kremer and J. Kolen, Eds. Piscataway, NJ: IEEE Press, 2001, pp. 237-244.
    • (2001) A Field Guide to Dynamical Recurrent Neural Networks , pp. 237-244
    • Hochreiter, S.1    Bengio, Y.2    Frasconi, P.3    Schmidhuber, J.4
  • 84
    • 0032673049 scopus 로고    scopus 로고
    • Restructuring speech representations using pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds
    • H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne, "Restructuring speech representations using pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds," Speech Commun., vol. 27, nos. 3-4, pp. 187-207, 1999.
    • (1999) Speech Commun. , vol.27 , Issue.3-4 , pp. 187-207
    • Kawahara, H.1    Masuda-Katsuse, I.2    De Cheveigne, A.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.