SCOPUS 정보 검색 플랫폼

Advances in Neural Information Processing Systems

Volumn 2018-December, Issue , 2018, Pages 10019-10029

Neural voice cloning with a few samples

(5) Arık, Sercan Ö a Chen, Jitong a Peng, Kainan a Ping, Wei a Zhou, Yanqi a

a BAIDU INC (China)

Author keywords

[No Author keywords available]

Indexed keywords

ENCODING (SYMBOLS); SIGNAL ENCODING;

AUDIO SAMPLES; FINE TUNING; GENERATIVE MODEL; RESOURCE DEPLOYMENTS; SPEAKER ADAPTATION; SPEECH INTERFACE;

CLONING;

EID: 85064829543 PISSN: 10495258 EISSN: None Source Type: Conference Proceeding
DOI: None Document Type: Conference Paper

Times cited : (302)

References (39)

1
- 85064830948
- Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code
- O. Abdel-Hamid and H. Jiang. Fast speaker adaptation of hybrid nn/hmm model for speech recognition based on discriminative learning of speaker code. In IEEE ICASSP, 2013.
- (2013) IEEE ICASSP
- Abdel-Hamid, O.¹ Jiang, H.²

2
- 85064816663
- Voice morphing that improves tts quality using an optimal dynamic frequency warping-and-weighting transform
- Y. Agiomyrgiannakis and Z. Roupakia. Voice morphing that improves tts quality using an optimal dynamic frequency warping-and-weighting transform. IEEE ICASSP, 2016.
- (2016) IEEE ICASSP
- Agiomyrgiannakis, Y.¹ Roupakia, Z.²

3
- 84971463350
- Deep speech 2: End-to-end speech recognition in english and Mandarin
- D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pages 173-182, 2016.
- (2016) International Conference on Machine Learning , pp. 173-182
- Amodei, D.¹ Ananthanarayanan, S.² Anubhai, R.³ Bai, J.⁴ Battenberg, E.⁵ Case, C.⁶ Casper, J.⁷ Catanzaro, B.⁸ Cheng, Q.⁹ Chen, G.¹⁰

4
- 85039156048
- Deep voice: Real-time neural text-to-speech
- S. Ö. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta, and M. Shoeybi. Deep Voice: Real-time neural text-to-speech. In ICML, 2017a.
- (2017) ICML
- Arik, S.Ö.¹ Chrzanowski, M.² Coates, A.³ Diamos, G.⁴ Gibiansky, A.⁵ Kang, Y.⁶ Li, X.⁷ Miller, J.⁸ Raiman, J.⁹ Sengupta, S.¹⁰ Shoeybi, M.¹¹

5
- 85046637415
- Deep voice 2: Multi-speaker neural text-to-speech
- S. Ö. Arik, G. F. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou. Deep Voice 2: Multi-speaker neural text-to-speech. In NIPS, pages 2966-2974, 2017b.
- (2017) NIPS , pp. 2966-2974
- Arik, S.Ö.¹ Diamos, G.F.² Gibiansky, A.³ Miller, J.⁴ Peng, K.⁵ Ping, W.⁶ Raiman, J.⁷ Zhou, Y.⁸

6
- 85064836326
- Multi-content gan for few-shot font style transfer
- abs/1708.02182
- S. Azadi, M. Fisher, V. Kim, Z. Wang, E. Shechtman, and T. Darrell. Multi-content gan for few-shot font style transfer. CoRR, abs/1708.02182, 2017.
- (2017) CoRR
- Azadi, S.¹ Fisher, M.² Kim, V.³ Wang, Z.⁴ Shechtman, E.⁵ Darrell, T.⁶

7
- 84921735339
- Voice conversion using deep neural networks with layer-wise generative training
- L. H. Chen, Z. H. Ling, L. J. Liu, and L. R. Dai. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014.
- (2014) IEEE/ACM Transactions on Audio, Speech, and Language Processing
- Chen, L.H.¹ Ling, Z.H.² Liu, L.J.³ Dai, L.R.⁴

8
- 85063080493
- arXiv preprint
- X. Cui, V. Goel, and G. Saon. Embedding-based speaker adaptive training of deep neural networks. arXiv preprint arXiv:1710.06937, 2017.
- (2017) Embedding-Based Speaker Adaptive Training of Deep Neural Networks
- Cui, X.¹ Goel, V.² Saon, G.³

9
- 77953707533
- Spectral mapping using artificial neural networks for voice conversion
- S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad. Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 2010.
- (2010) IEEE Transactions on Audio, Speech, and Language Processing
- Desai, S.¹ Black, A.W.² Yegnanarayana, B.³ Prahallad, K.⁴

10
- 84986185211
- A probabilistic interpretation for artificial neural network-based voice conversion
- H. T. Hwang, Y. Tsao, H. M. Wang, Y. R. Wang, and S. H. Chen. A probabilistic interpretation for artificial neural network-based voice conversion. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2015.
- (2015) 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
- Hwang, H.T.¹ Tsao, Y.² Wang, H.M.³ Wang, Y.R.⁴ Chen, S.H.⁵

11
- 84978840213
- arXiv preprint
- R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- (2016) Exploring the Limits of Language Modeling
- Jozefowicz, R.¹ Vinyals, O.² Schuster, M.³ Shazeer, N.⁴ Wu, Y.⁵

12
- 85049871154
- Progressive growing of gans for improved quality, stability, and variation
- abs/1710.10196
- T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. CoRR, abs/1710.10196, 2017.
- (2017) CoRR
- Karras, T.¹ Aila, T.² Laine, S.³ Lehtinen, J.⁴

13
- 84898998554
- One-shot learning by inverting a compositional causal process
- B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. One-shot learning by inverting a compositional causal process. In NIPS, 2013.
- (2013) NIPS
- Lake, B.M.¹ Salakhutdinov, R.² Tenenbaum, J.B.³

14
- 84999020055
- One-shot learning of generative speech concepts
- B. M. Lake, C. ying Lee, J. R. Glass, and J. B. Tenenbaum. One-shot learning of generative speech concepts. In CogSci, 2014.
- (2014) CogSci
- Lake, B.M.¹ Ying Lee, C.² Glass, J.R.³ Tenenbaum, J.B.⁴

15
- 84949683101
- Human-level concept learning through probabilistic program induction
- B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 2015.
- (2015) Science
- Lake, B.M.¹ Salakhutdinov, R.² Tenenbaum, J.B.³

16
- 84959173377
- Modeling speaker variability using long short-term memory networks for speech recognition
- X. Li and X. Wu. Modeling speaker variability using long short-term memory networks for speech recognition. In INTERSPEECH, 2015.
- (2015) INTERSPEECH
- Li, X.¹ Wu, X.²

17
- 85039166060
- arXiv preprint
- S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
- (2016) Samplernn: An Unconditional End-to-End Neural Audio Generation Model
- Mehri, S.¹ Kumar, K.² Gulrajani, I.³ Kumar, R.⁴ Jain, S.⁵ Sotelo, J.⁶ Courville, A.⁷ Bengio, Y.⁸

18
- 84938725974
- On speaker adaptation of long short-term memory recurrent neural networks
- Y. Miao and F. Metze. On speaker adaptation of long short-term memory recurrent neural networks. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
- (2015) Sixteenth Annual Conference of the International Speech Communication Association
- Miao, Y.¹ Metze, F.²

19
- 84938688160
- Speaker adaptive training of deep neural network acoustic models using i-vectors
- Y. Miao, H. Zhang, and F. Metze. Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015.
- (2015) IEEE/ACM Transactions on Audio, Speech, and Language Processing
- Miao, Y.¹ Zhang, H.² Metze, F.³

20
- 85011070895
- arXiv preprint
- A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016a.
- (2016) Wavenet: A Generative Model for Raw Audio
- Oord, A.¹ Dieleman, S.² Zen, H.³ Simonyan, K.⁴ Vinyals, O.⁵ Graves, A.⁶ Kalchbrenner, N.⁷ Senior, A.⁸ Kavukcuoglu, K.⁹

21
- 85018873682
- Conditional image generation with pixelcnn decoders
- A. v. d. Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, 2016b.
- (2016) Advances in Neural Information Processing Systems
- Oord, A.¹ Kalchbrenner, N.² Espeholt, L.³ Vinyals, O.⁴ Graves, A.⁵

22
- 85023755462
- Librispeech: An ASR corpus based on public domain audio books
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: an ASR corpus based on public domain audio books. In IEEE ICASSP, 2015.
- (2015) IEEE ICASSP
- Panayotov, V.¹ Chen, G.² Povey, D.³ Khudanpur, S.⁴

23
- 85083953940
- Deep voice 3: Scaling text-to-speech with convolutional sequence learning
- W. Ping, K. Peng, A. Gibiansky, S. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. In ICLR, 2018.
- (2018) ICLR
- Ping, W.¹ Peng, K.² Gibiansky, A.³ Arik, S.⁴ Kannan, A.⁵ Narang, S.⁶ Raiman, J.⁷ Miller, J.⁸

24
- 50649094277
- Probabilistic linear discriminant analysis for inferences about identity
- S. Prince and J. Elder. Probabilistic linear discriminant analysis for inferences about identity. In ICCV, 2007.
- (2007) ICCV
- Prince, S.¹ Elder, J.²

25
- 85064821886
- Few-shot autoregressive density estimation: Towards learning to learn distributions
- S. E. Reed, Y. Chen, T. Paine, A. van den Oord, S. M. A. Eslami, D. J. Rezende, O. Vinyals, and N. de Freitas. Few-shot autoregressive density estimation: Towards learning to learn distributions. CoRR, 2017.
- (2017) CoRR
- Reed, S.E.¹ Chen, Y.² Paine, T.³ Van Den Oord, A.⁴ Eslami, S.M.A.⁵ Rezende, D.J.⁶ Vinyals, O.⁷ De Freitas, N.⁸

26
- 84998631632
- One-shot generalization in deep generative models
- D. Rezende, Shakir, I. Danihelka, K. Gregor, and D. Wierstra. One-shot generalization in deep generative models. In ICML, 2016.
- (2016) ICML
- Rezende, D.¹ Shakir, I.D.² Gregor, K.³ Wierstra, D.⁴

27
- 85049199993
- arXiv preprint
- J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. arXiv preprint arXiv:1712.05884, 2017.
- (2017) Natural Tts Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions
- Shen, J.¹ Pang, R.² Weiss, R.J.³ Schuster, M.⁴ Jaitly, N.⁵ Yang, Z.⁶ Chen, Z.⁷ Zhang, Y.⁸ Wang, Y.⁹ Skerry-Ryan, R.¹⁰

28
- 85015988152
- Deep neural network-based speaker embeddings for end-to-end speaker verification
- D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur. Deep neural network-based speaker embeddings for end-to-end speaker verification. In IEEE Spoken Language Technology Workshop (SLT), pages 165-170, 2016.
- (2016) IEEE Spoken Language Technology Workshop (SLT) , pp. 165-170
- Snyder, D.¹ Ghahremani, P.² Povey, D.³ Garcia-Romero, D.⁴ Carmiel, Y.⁵ Khudanpur, S.⁶

29
- 85039169377
- J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio. Char2wav: End-to-end speech synthesis. 2017.
- (2017) Char2wav: End-to-End Speech Synthesis
- Sotelo, J.¹ Mehri, S.² Kumar, K.³ Santos, J.F.⁴ Kastner, K.⁵ Courville, A.⁶ Bengio, Y.⁷

30
- 85083953646
- VoiceLoop: Voice fitting and synthesis via a phonological loop
- Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani. Voiceloop: Voice fitting and synthesis via a phonological loop. In ICLR, 2018.
- (2018) ICLR
- Taigman, Y.¹ Wolf, L.² Polyak, A.³ Nachmani, E.⁴

31
- 85043317328
- Attention is all you need
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS. 2017.
- (2017) NIPS
- Vaswani, A.¹ Shazeer, N.² Parmar, N.³ Uszkoreit, J.⁴ Jones, L.⁵ Gomez, A.N.⁶ Kaiser, L.⁷ Polosukhin, I.⁸

32
- 85054236992
- C. Veaux, J. Yamagishi, and K. e. a. MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit, 2017.
- (2017) Cstr Vctk Corpus: English Multi-Speaker Corpus for Cstr Voice Cloning Toolkit
- Veaux, C.¹ Yamagishi, J.² MacDonald, K.³

33
- 85050561560
- Tacotron: A fully end-to-end text-to-speech synthesis model
- abs/1703.10135
- Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous. Tacotron: A fully end-to-end text-to-speech synthesis model. CoRR, abs/1703.10135, 2017.
- (2017) CoRR
- Wang, Y.¹ Skerry-Ryan, R.J.² Stanton, D.³ Wu, Y.⁴ Weiss, R.J.⁵ Jaitly, N.⁶ Yang, Z.⁷ Xiao, Y.⁸ Chen, Z.⁹ Bengio, S.¹⁰ Le, Q.V.¹¹ Agiomyrgiannakis, Y.¹² Clark, R.¹³ Saurous, R.A.¹⁴

34
- 84994351528
- Analysis of the voice conversion challenge 2016 evaluation results
- 09
- M. Wester, Z. Wu, and J. Yamagishi. Analysis of the voice conversion challenge 2016 evaluation results. In INTERSPEECH, pages 1637-1641, 09 2016.
- (2016) INTERSPEECH , pp. 1637-1641
- Wester, M.¹ Wu, Z.² Yamagishi, J.³

35
- 84994247053
- Locally linear embedding for exemplar-based spectral conversion
- 09
- Y.-C. Wu, H.-T. Hwang, C.-C. Hsu, Y. Tsao, and H.-m. Wang. Locally linear embedding for exemplar-based spectral conversion. In INTERSPEECH, pages 1652-1656, 09 2016.
- (2016) INTERSPEECH , pp. 1652-1656
- Wu, Y.-C.¹ Hwang, H.-T.² Hsu, C.-C.³ Tsao, Y.⁴ Wang, H.⁵

36
- 84959112868
- A study of speaker adaptation for dnn-based speech synthesis
- Z. Wu, P. Swietojanski, C. Veaux, S. Renals, and S. King. A study of speaker adaptation for dnn-based speech synthesis. In INTERSPEECH, 2015.
- (2015) INTERSPEECH
- Wu, Z.¹ Swietojanski, P.² Veaux, C.³ Renals, S.⁴ King, S.⁵

37
- 84921731072
- Fast adaptation of deep neural network based on discriminant codes for speech recognition
- S. Xue, O. Abdel-Hamid, H. Jiang, L. Dai, and Q. Liu. Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014.
- (2014) IEEE/ACM Transactions on Audio, Speech, and Language Processing
- Xue, S.¹ Abdel-Hamid, O.² Jiang, H.³ Dai, L.⁴ Liu, Q.⁵

38
- 67650854725
- Analysis of speaker adaptation algorithms for hmm-based speech synthesis and a constrained smaplr adaptation algorithm
- J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, and J. Isogai. Analysis of speaker adaptation algorithms for hmm-based speech synthesis and a constrained smaplr adaptation algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 2009.
- (2009) IEEE Transactions on Audio, Speech, and Language Processing
- Yamagishi, J.¹ Kobayashi, T.² Nakano, Y.³ Ogata, K.⁴ Isogai, J.⁵

39
- 85064811346
- Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition
- D. Yu, K. Yao, H. Su, G. Li, and F. Seide. Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In IEEE ICASSP, 2013.
- (2013) IEEE ICASSP
- Yu, D.¹ Yao, K.² Su, H.³ Li, G.⁴ Seide, F.⁵

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.