SCOPUS 정보 검색 플랫폼

6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings

Volumn , Issue , 2018, Pages

Deep Voice 3: Scaling text-to-speech with convolutional sequence learning

(8) Ping, Wei a Peng, Kainan a Gibiansky, Andrew a Arık, Sercan Ö a Kannan, Ajay a Narang, Sharan a Raiman, Jonathan b Miller, John c

a BAIDU INC (China)

b OpenAI LLC (United States)

c UNIVERSITY OF CALIFORNIA (United States)

Author keywords

[No Author keywords available]

Indexed keywords

CONVOLUTION; SPEECH SYNTHESIS;

DATA SET SIZE; ERROR MODE; SEQUENCE LEARNING; SPEECH SYNTHESIS SYSTEM; STATE OF THE ART; TEXT TO SPEECH; TEXT-TO-SPEECH SYSTEM; WAVEFORM SYNTHESIS;

DEEP LEARNING;

EID: 85083953940 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: None Document Type: Conference Paper

Times cited : (286)

References (30)

1
- 84930664922
- Vocaine the vocoder and applications in speech synthesis
- Yannis Agiomyrgiannakis. Vocaine the vocoder and applications in speech synthesis. In ICASSP, 2015.
- (2015) ICASSP
- Agiomyrgiannakis, Y.¹

2
- 85039156048
- Deep voice: Real-time neural text-to-speech
- Sercan Ö. Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Jonathan Raiman, Shubho Sengupta, and Mohammad Shoeybi. Deep Voice: Real-time neural text-to-speech. In ICML, 2017.
- (2017) ICML
- Arık, S.Ö.¹ Chrzanowski, M.² Coates, A.³ Diamos, G.⁴ Gibiansky, A.⁵ Kang, Y.⁶ Li, X.⁷ Miller, J.⁸ Raiman, J.⁹ Sengupta, S.¹⁰ Shoeybi, M.¹¹

3
- 85046637415
- Deep Voice 2: Multi-speaker neural text-to-speech
- Sercan Ö. Arık, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep Voice 2: Multi-speaker neural text-to-speech. In NIPS, 2017b.
- (2017) NIPS
- Arık, S.Ö.¹ Diamos, G.² Gibiansky, A.³ Miller, J.⁴ Peng, K.⁵ Ping, W.⁶ Raiman, J.⁷ Zhou, Y.⁸

4
- 84943262548
- Chris Bagwell. Sox - sound exchange. https://sourceforge.net/p/sox/code/ci/master/tree/, 2017.
- (2017) Sox - Sound Exchange
- Bagwell, C.¹

5
- 85083953689
- Neural machine translation by jointly learning to align and translate
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- (2015) ICLR
- Bahdanau, D.¹ Cho, K.² Bengio, Y.³

6
- 85039170210
- Siri on-device deep learning-guided unit selection text-to-speech system
- Tim Capes, Paul Coles, Alistair Conkie, Ladan Golipour, Abie Hadjitarkhani, Qiong Hu, Nancy Huddleston, Melvyn Hunt, Jiangchuan Li, Matthias Neeracher, et al. Siri on-device deep learning-guided unit selection text-to-speech system. In Interspeech, 2017.
- (2017) Interspeech
- Capes, T.¹ Coles, P.² Conkie, A.³ Golipour, L.⁴ Hadjitarkhani, A.⁵ Hu, Q.⁶ Huddleston, N.⁷ Hunt, M.⁸ Li, J.⁹ Neeracher, M.¹⁰

7
- 84961291190
- Learning phrase representations using RNN encoder-decoder for statistical machine translation
- Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014.
- (2014) EMNLP
- Cho, K.¹ Van Merriënboer, B.² Gulcehre, C.³ Bahdanau, D.⁴ Bougares, F.⁵ Schwenk, H.-G.⁶ Bengio, Y.⁷

8
- 84965139600
- Attention-based models for speech recognition
- Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In NIPS, 2015.
- (2015) NIPS
- Chorowski, J.K.¹ Bahdanau, D.² Serdyuk, D.³ Cho, K.⁴ Bengio, Y.⁵

9
- 85048443641
- Language modeling with gated convolutional networks
- Yann Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, 2017.
- (2017) ICML
- Dauphin, Y.¹ Fan, A.² Auli, M.³ Grangier, D.⁴

10
- 85046994169
- Convolutional sequence to sequence learning
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann Dauphin. Convolutional sequence to sequence learning. In ICML, 2017.
- (2017) ICML
- Gehring, J.¹ Auli, M.² Grangier, D.³ Yarats, D.⁴ Dauphin, Y.⁵

11
- 84994309294
- Recent advances in Google real-time HMM-driven unit selection synthesizer
- Xavi Gonzalvo, Siamak Tazari, Chun-an Chan, Markus Becker, Alexander Gutkin, and Hanna Silen. Recent advances in Google real-time HMM-driven unit selection synthesizer. In Interspeech, 2016.
- (2016) Interspeech
- Gonzalvo, X.¹ Tazari, S.² Chan, C.-A.³ Becker, M.⁴ Gutkin, A.⁵ Silen, H.⁶

12
- 0021407831
- Signal estimation from modified short-time fourier transform
- Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1984.
- (1984) IEEE Transactions on Acoustics, Speech, and Signal Processing
- Griffin, D.¹ Lim, J.²

13
- 0032673049
- Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds
- Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain De Cheveigne. Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds. Speech communication, 1999.
- (1999) Speech Communication
- Kawahara, H.¹ Masuda-Katsuse, I.² De Cheveigne, A.³

14
- 85088227413
- Samplernn: An unconditional end-to-end neural audio generation model
- Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. In ICLR, 2017.
- (2017) ICLR
- Mehri, S.¹ Kumar, K.² Gulrajani, I.³ Kumar, R.⁴ Jain, S.⁵ Sotelo, J.⁶ Courville, A.⁷ Bengio, Y.⁸

15
- 84976902575
- World: A vocoder-based high-quality speech synthesis system for real-time applications
- Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 2016.
- (2016) IEICE Transactions on Information and Systems
- Morise, M.¹ Yokomori, F.² Ozawa, K.³

16
- 85054969439
- Robert Ochshorn and Max Hawkins. Gentle. https://github.com/lowerquality/ gentle, 2017.
- (2017) Gentle
- Ochshorn, R.¹ Hawkins, M.²

17
- 85011070895
- Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:1609.03499, 2016.
- (2016) WaveNet: A Generative Model for Raw Audio
- Van den Oord, A.¹ Dieleman, S.² Zen, H.³ Simonyan, K.⁴ Vinyals, O.⁵ Graves, A.⁶ Kalchbrenner, N.⁷ Senior, A.⁸ Kavukcuoglu, K.⁹

18
- 84946015916
- Librispeech: An ASR corpus based on public domain audio books
- IEEE
- Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 5206–5210. IEEE, 2015.
- (2015) Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on , pp. 5206-5210
- Panayotov, V.¹ Chen, G.² Povey, D.³ Khudanpur, S.⁴

19
- 85048524283
- Online and linear-time attention by enforcing monotonic alignments
- Colin Raffel, Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck. Online and linear-time attention by enforcing monotonic alignments. In ICML, 2017.
- (2017) ICML
- Raffel, C.¹ Luong, T.² Liu, P.J.³ Weiss, R.J.⁴ Eck, D.⁵

20
- 85047003030
- CrowDMOS: An approach for crowdsourcing mean opinion score studies
- Flávio Ribeiro, Dinei Florêncio, Cha Zhang, and Michael Seltzer. Crowdmos: An approach for crowdsourcing mean opinion score studies. In IEEE ICASSP, 2011.
- (2011) IEEE ICASSP
- Ribeiro, F.¹ Florêncio, D.² Zhang, C.³ Seltzer, M.⁴

21
- 85011836388
- A neural attention model for abstractive sentence summarization
- Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. In EMNLP, 2015.
- (2015) EMNLP
- Rush, A.M.¹ Chopra, S.² Weston, J.³

22
- 85017457992
- Weight normalization: A simple reparameterization to accelerate training of deep neural networks
- Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NIPS, 2016.
- (2016) NIPS
- Salimans, T.¹ Kingma, D.P.²

23
- 85122685393
- Char2Wav: End-to-end speech synthesis
- Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2wav: End-to-end speech synthesis. In ICLR workshop, 2017.
- (2017) ICLR Workshop
- Sotelo, J.¹ Mehri, S.² Kumar, K.³ Santos, J.F.⁴ Kastner, K.⁵ Courville, A.⁶ Bengio, Y.⁷

24
- 84928547704
- Sequence to sequence learning with neural networks
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- (2014) NIPS
- Sutskever, I.¹ Vinyals, O.² Le, Q.V.³

25
- 85047513277
- Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani. Voice synthesis for in-the-wild speakers via a phonological loop. arXiv:1707.06588, 2017.
- (2017) Voice Synthesis for in-The-Wild Speakers Via A Phonological Loop
- Taigman, Y.¹ Wolf, L.² Polyak, A.³ Nachmani, E.⁴

26
- 84925160976
- Cambridge University Press, New York, NY, USA, 1st edition, ISBN
- Paul Taylor. Text-to-Speech Synthesis. Cambridge University Press, New York, NY, USA, 1st edition, 2009. ISBN 0521899273, 9780521899277.
- (2009) Text-to-Speech Synthesis
- Taylor, P.¹

27
- 85038368581
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv:1706.03762, 2017.
- (2017) Attention Is All You Need
- Vaswani, A.¹ Shazeer, N.² Parmar, N.³ Uszkoreit, J.⁴ Jones, L.⁵ Gomez, A.N.⁶ Kaiser, L.⁷ Polosukhin, I.⁸

28
- 85038442478
- Saurous. Tacotron: Towards end-to-end speech synthesis
- Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In Interspeech, 2017.
- (2017) Interspeech
- Wang, Y.¹ Skerry-Ryan, R.J.² Stanton, D.³ Wu, Y.⁴ Weiss, R.⁵ Jaitly, N.⁶ Yang, Z.⁷ Xiao, Y.⁸ Chen, Z.⁹ Bengio, S.¹⁰ Le, Q.¹¹ Agiomyrgiannakis, Y.¹² Clark, R.¹³ Rif, A.¹⁴

29
- 85008006694
- Robust speaker-adaptive hmm-based text-to-speech synthesis
- Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhen-Hua Ling, Tomoki Toda, Keiichi Tokuda, Simon King, and Steve Renals. Robust speaker-adaptive hmm-based text-to-speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 2009.
- (2009) IEEE Transactions on Audio, Speech, and Language Processing
- Yamagishi, J.¹ Nose, T.² Zen, H.³ Ling, Z.-H.⁴ Toda, T.⁵ Tokuda, K.⁶ King, S.⁷ Renals, S.⁸

30
- 77953708096
- Thousands of voices for hmm-based speech synthesis–analysis and application of tts systems built on various asr corpora
- Junichi Yamagishi, Bela Usabaev, Simon King, Oliver Watts, John Dines, Jilei Tian, Yong Guan, Rile Hu, Keiichiro Oura, Yi-Jian Wu, et al. Thousands of voices for hmm-based speech synthesis–analysis and application of tts systems built on various asr corpora. IEEE Transactions on Audio, Speech, and Language Processing, 2010.
- (2010) IEEE Transactions on Audio, Speech, and Language Processing
- Yamagishi, J.¹ Usabaev, B.² King, S.³ Watts, O.⁴ Dines, J.⁵ Tian, J.⁶ Guan, Y.⁷ Hu, R.⁸ Oura, K.⁹ Wu, Y.-J.¹⁰

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.