SCOPUS 정보 검색 플랫폼

EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings

Volumn , Issue , 2016, Pages 2122-2132

How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation

(6) Liu, Chia Wei a Lowe, Ryan a Serban, Iulian V b Noseworthy, Michael a Charlin, Laurent a Pineau, Joelle a

a MCGILL UNIVERSITY (Canada)

b UNIVERSITÉ DE MONTRÉAL (Canada)

Author keywords

[No Author keywords available]

Indexed keywords

SPEECH PROCESSING;

AUTOMATIC EVALUATION; DIALOGUE SYSTEMS; EMPIRICAL STUDIES; EVALUATION METRICS; MACHINE TRANSLATIONS; RESPONSE GENERATION; TARGET RESPONSE;

NATURAL LANGUAGE PROCESSING SYSTEMS;

EID: 85072827450 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.18653/v1/d16-1230 Document Type: Conference Paper

Times cited : (1271)

References (45)

1
- 70350666834
- Semi-formal evaluation of conversational characters
- Springer
- R. Artstein, S. Gandhe, J. Gerten, A. Leuski, and D. Traum. 2009. Semi-formal evaluation of conversational characters. In Languages: From Formal to Natural, pages 22-35. Springer.
- (2009) Languages: From Formal to Natural , pp. 22-35
- Artstein, R.¹ Gandhe, S.² Gerten, J.³ Leuski, A.⁴ Traum, D.⁵

2
- 85116156579
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
- S. Banerjee and A. Lavie. 2005. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.
- (2005) Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization
- Banerjee, S.¹ Lavie, A.²

3
- 85122640794
- Findings of the 2014 workshop on statistical machine translation
- Association for Computational Linguistics Baltimore, MD, USA
- O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, et al. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12-58. Association for Computational Linguistics Baltimore, MD, USA.
- (2014) Proceedings of the Ninth Workshop on Statistical Machine Translation , pp. 12-58
- Bojar, O.¹ Buck, C.² Federmann, C.³ Haddow, B.⁴ Koehn, P.⁵ Leveling, J.⁶ Monz, C.⁷ Pecina, P.⁸ Post, M.⁹ Saint-Amand, H.¹⁰

4
- 84859906372
- Correlating human and automatic evaluation of a German surface realiser
- Association for Computational Linguistics
- A. Cahill. 2009. Correlating human and automatic evaluation of a german surface realiser. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 97-100. Association for Computational Linguistics.
- (2009) Proceedings of the ACL-IJCNLP 2009 Conference Short Papers , pp. 97-100
- Cahill, A.¹

5
- 84893361786
- Re-evaluation the role of bleu in machine translation research
- C. Callison-Burch, M. Osborne, and P. Koehn. 2006. Re-evaluation the role of bleu in machine translation research. In EACL, volume 6, pages 249-256.
- (2006) EACL , vol.6 , pp. 249-256
- Callison-Burch, C.¹ Osborne, M.² Koehn, P.³

6
- 84926313951
- Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation
- Association for Computational Linguistics
- C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O. F. Zaidan. 2010. Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 17-53. Association for Computational Linguistics.
- (2010) Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR , pp. 17-53
- Callison-Burch, C.¹ Koehn, P.² Monz, C.³ Peterson, K.⁴ Przybocki, M.⁵ Zaidan, O.F.⁶

7
- 85122015378
- Findings of the 2011 workshop on statistical machine translation
- Association for Computational Linguistics
- C. Callison-Burch, P. Koehn, C. Monz, and O. F. Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 22-64. Association for Computational Linguistics.
- (2011) Proceedings of the Sixth Workshop on Statistical Machine Translation , pp. 22-64
- Callison-Burch, C.¹ Koehn, P.² Monz, C.³ Zaidan, O.F.⁴

8
- 85122610414
- A systematic comparison of smoothing techniques for sentence-level bleu
- B. Chen and C. Cherry. 2014. A systematic comparison of smoothing techniques for sentence-level bleu. ACL 2014, page 362.
- (2014) ACL , vol.2014 , pp. 362
- Chen, B.¹ Cherry, C.²

9
- 58149412516
- Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit
- J. Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213.
- (1968) Psychological Bulletin , vol.70 , Issue.4 , pp. 213
- Cohen, J.¹

10
- 80053285631
- Further meta-evaluation of broad-coverage surface realization
- Association for Computational Linguistics
- D. Espinosa, R. Rajkumar, M. White, and S. Berleant. 2010. Further meta-evaluation of broad-coverage surface realization. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 564-574. Association for Computational Linguistics.
- (2010) Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing , pp. 564-574
- Espinosa, D.¹ Rajkumar, R.² White, M.³ Berleant, S.⁴

11
- 80053431219
- The measurement of textual coherence with latent semantic analysis
- P. W. Foltz, W. Kintsch, and T. K. Landauer. 1998. The measurement of textual coherence with latent semantic analysis. Discourse processes, 25(2-3):285-307.
- (1998) Discourse Processes , vol.25 , Issue.2-3 , pp. 285-307
- Foltz, P.W.¹ Kintsch, W.² Landauer, T.K.³

12
- 85030462162
- G. Forgues, J. Pineau, J.-M. Larcheveque, and R. Tremblay. 2014. Bootstrapping dialog systems with word embeddings.
- (2014) Bootstrapping Dialog Systems with Word Embeddings
- Forgues, G.¹ Pineau, J.² Larcheveque, J.-M.³ Tremblay, R.⁴

13
- 84944036795
- Deltableu: A discriminative metric for generation tasks with intrinsically diverse targets
- M. l
- M. Galley, C. Brockett, A. Sordoni, Y. Ji, M. Auli, C. Quirk, M. l, J. Gao, and B. Dolan. 2015a. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (Short Papers).
- (2015) Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (Short Papers)
- Galley, M.¹ Brockett, C.² Sordoni, A.³ Ji, Y.⁴ Auli, M.⁵ Quirk, C.⁶ Gao, J.⁷ Dolan, B.⁸

14
- 84988438299
- arXiv preprint
- M. Galley, C. Brockett, A. Sordoni, Y. Ji, M. Auli, C. Quirk, M. Mitchell, J. Gao, and B. Dolan. 2015b. deltableu: A discriminative metric for generation tasks with intrinsically diverse targets. arXiv preprint arXiv:1506.06863.
- (2015) Deltableu: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets
- Galley, M.¹ Brockett, C.² Sordoni, A.³ Ji, Y.⁴ Auli, M.⁵ Quirk, C.⁶ Mitchell, M.⁷ Gao, J.⁸ Dolan, B.⁹

15
- 84944091060
- Accurate evaluation of segment-level machine translation metrics
- Cite-seer
- Y. Graham, N. Mathur, and T. Baldwin. 2015. Accurate evaluation of segment-level machine translation metrics. In Proc. of NAACL-HLT, pages 1183-1191. Cite-seer.
- (2015) Proc. Of NAACL-HLT , pp. 1183-1191
- Graham, Y.¹ Mathur, N.² Baldwin, T.³

16
- 84906979661
- arXiv preprint
- A. Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
- (2013) Generating Sequences with Recurrent Neural Networks
- Graves, A.¹

17
- 0031573117
- Long short-term memory
- S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735-1780.
- (1997) Neural Computation , vol.9 , Issue.8 , pp. 1735-1780
- Hochreiter, S.¹ Schmidhuber, J.²

18
- 22944467425
- Toward finely differentiated evaluation metrics for machine translation
- E. Hovy. 1999. Toward finely differentiated evaluation metrics for machine translation. In Proceedings of the Eagles Workshop on Standards and Evaluation.
- (1999) Proceedings of the Eagles Workshop on Standards and Evaluation
- Hovy, E.¹

19
- 79951776486
- Morgan Claypool
- K. Jokinen and M. McTear. 2009. Spoken Dialogue Systems. Morgan Claypool.
- (2009) Spoken Dialogue Systems
- Jokinen, K.¹ McTear, M.²

20
- 0028839533
- User interfaces for voice applications
- C. Kamm. 1995. User interfaces for voice applications. Proceedings of the National Academy of Sciences, 92(22):10031-10037.
- (1995) Proceedings of the National Academy of Sciences , vol.92 , Issue.22 , pp. 10031-10037
- Kamm, C.¹

21
- 84965153327
- Skip-thought vectors
- R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems, pages 3276-3284.
- (2015) Advances in Neural Information Processing Systems , pp. 3276-3284
- Kiros, R.¹ Zhu, Y.² Salakhutdinov, R.R.³ Zemel, R.⁴ Urtasun, R.⁵ Torralba, A.⁶ Fidler, S.⁷

22
- 0000600219
- A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge
- T. K. Landauer and S. T. Dumais. 1997. A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211.
- (1997) Psychological Review , vol.104 , Issue.2 , pp. 211
- Landauer, T.K.¹ Dumais, S.T.²

23
- 84901809785
- Utilizing human-to-human conversation examples for a multi domain chat-oriented dialog system
- N. Lasguido, S. Sakti, G. Neubig, T. Tomoki, and S. Nakamura. 2014. Utilizing human-to-human conversation examples for a multi domain chat-oriented dialog system. IEICE TRANSACTIONS on Information and Systems, 97(6):1497-1505.
- (2014) IEICE TRANSACTIONS on Information and Systems , vol.97 , Issue.6 , pp. 1497-1505
- Lasguido, N.¹ Sakti, S.² Neubig, G.³ Tomoki, T.⁴ Nakamura, S.⁵

24
- 84980339123
- arXiv preprint
- J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. 2015. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055.
- (2015) A Diversity-Promoting Objective Function for Neural Conversation Models
- Li, J.¹ Galley, M.² Brockett, C.³ Gao, J.⁴ Dolan, B.⁵

25
- 85020039916
- arXiv preprint
- J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. 2016. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155.
- (2016) A Persona-Based Neural Conversation Model
- Li, J.¹ Galley, M.² Brockett, C.³ Gao, J.⁴ Dolan, B.⁵

26
- 26944501715
- Rouge: A package for automatic evaluation of summaries
- C.-Y. Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8.
- (2004) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop , vol.8
- Lin, C.-Y.¹

27
- 84988430909
- The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems
- R. Lowe, N. Pow, I. V. Serban, and J. Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In SIGDIAL.
- (2015) SIGDIAL
- Lowe, R.¹ Pow, N.² Serban, I.V.³ Pineau, J.⁴

28
- 84898956512
- Distributed representations of words and phrases and their compositionality
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111-3119.
- (2013) Advances in Neural Information Processing Systems , pp. 3111-3119
- Mikolov, T.¹ Sutskever, I.² Chen, K.³ Corrado, G.S.⁴ Dean, J.⁵

29
- 84859927665
- Vector-based models of semantic composition
- J. Mitchell and M. Lapata. 2008. Vector-based models of semantic composition. In ACL, pages 236-244.
- (2008) ACL , pp. 236-244
- Mitchell, J.¹ Lapata, M.²

30
- 38349073645
- MemO: Towards automatic usability evaluation of spoken dialogue services by user error simulations
- S. Möller, R. Englert, K. Engelbrecht, V. Hafner, A. Jameson, A. Oulasvirta, A. Raake, and N. Reithinger. 2006. MeMo: towards automatic usability evaluation of spoken dialogue services by user error simulations. In INTERSPEECH.
- (2006) INTERSPEECH
- Möller, S.¹ Englert, R.² Engelbrecht, K.³ Hafner, V.⁴ Jameson, A.⁵ Oulasvirta, A.⁶ Raake, A.⁷ Reithinger, N.⁸

31
- 85133336275
- BLEU: A method for automatic evaluation of machine translation
- K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002a. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on Association for Computational Linguistics (ACL).
- (2002) Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL)
- Papineni, K.¹ Roukos, S.² Ward, T.³ Zhu, W.⁴

32
- 1242352804
- Corpus-based comprehensive and diagnostic MT evaluation: Initial Arabic, Chinese, French, and Spanish results
- K. Papineni, S. Roukos, T. Ward, J. Henderson, and F. Reeder. 2002b. Corpus-based comprehensive and diagnostic MT evaluation: Initial Arabic, Chinese, French, and Spanish results. In Proceedings of the second international conference on Human Language Technology Research, pages 132-137.
- (2002) Proceedings of the Second International Conference on Human Language Technology Research , pp. 132-137
- Papineni, K.¹ Roukos, S.² Ward, T.³ Henderson, J.⁴ Reeder, F.⁵

33
- 71749094730
- An investigation into the validity of some metrics for automatically evaluating natural language generation systems
- E. Reiter and A. Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529-558.
- (2009) Computational Linguistics , vol.35 , Issue.4 , pp. 529-558
- Reiter, E.¹ Belz, A.²

34
- 84858376012
- Unsupervised modeling of twitter conversations
- A. Ritter, C. Cherry, and B. Dolan. 2010. Unsupervised modeling of twitter conversations. In North American Chapter of the Association for Computational Linguistics (NAACL).
- (2010) North American Chapter of the Association for Computational Linguistics (NAACL)
- Ritter, A.¹ Cherry, C.² Dolan, B.³

35
- 80053292690
- Data-driven response generation in social media
- Association for Computational Linguistics
- A. Ritter, C. Cherry, and W. B. Dolan. 2011. Data-driven response generation in social media. In Proceedings of the conference on empirical methods in natural language processing, pages 583-593. Association for Computational Linguistics.
- (2011) Proceedings of the Conference on Empirical Methods in Natural Language Processing , pp. 583-593
- Ritter, A.¹ Cherry, C.² Dolan, W.B.³

36
- 85036049150
- A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics
- Stroudsburg, PA, USA. Association for Computational Linguistics
- V. Rus and M. Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 157-162, Stroudsburg, PA, USA. Association for Computational Linguistics.
- (2012) Proceedings of the Seventh Workshop on Building Educational Applications Using NLP , pp. 157-162
- Rus, V.¹ Lintean, M.²

37
- 84857755459
- Quantitative evaluation of user simulation techniques for spoken dialogue systems
- J. Schatzmann, K. Georgila, and S. Young. 2005. Quantitative evaluation of user simulation techniques for spoken dialogue systems. In 6th Special Interest Group on Discourse and Dialogue (SIGDIAL).
- (2005) 6th Special Interest Group on Discourse and Dialogue (SIGDIAL)
- Schatzmann, J.¹ Georgila, K.² Young, S.³

38
- 84994160039
- Building end-to-end dialogue systems using generative hierarchical neural networks
- I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. 2015. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Networks. In AAAI Conference on Artificial Intelligence.
- (2015) AAAI Conference on Artificial Intelligence
- Serban, I.V.¹ Sordoni, A.² Bengio, Y.³ Courville, A.⁴ Pineau, J.⁵

39
- 85030483080
- arXiv preprint
- I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio. 2016. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069.
- (2016) A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues
- Serban, I.V.¹ Sordoni, A.² Lowe, R.³ Charlin, L.⁴ Pineau, J.⁵ Courville, A.⁶ Bengio, Y.⁷

40
- 84960121226
- A neural network approach to context-sensitive generation of conversational responses
- A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2015).
- (2015) Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2015)
- Sordoni, A.¹ Galley, M.² Auli, M.³ Brockett, C.⁴ Ji, Y.⁵ Mitchell, M.⁶ Nie, J.⁷ Gao, J.⁸ Dolan, B.⁹

41
- 24344465910
- Evaluating evaluation methods for generation in the presence of variation
- Springer
- A. Stent, M. Marge, and M. Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 341-351. Springer.
- (2005) International Conference on Intelligent Text Processing and Computational Linguistics , pp. 341-351
- Stent, A.¹ Marge, M.² Singhai, M.³

42
- 84980377939
- arXiv preprint
- O. Vinyals and Q. Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
- (2015) A Neural Conversational Model
- Vinyals, O.¹ Le, Q.²

43
- 85065183198
- Paradise: A framework for evaluating spoken dialogue agents
- M. Walker, D. Litman, C. Kamm, and A. Abella. 1997. Paradise: A framework for evaluating spoken dialogue agents. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 271-280. ACL.
- (1997) Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics , pp. 271-280
- Walker, M.¹ Litman, D.² Kamm, C.³ Abella, A.⁴

44
- 84959897734
- arXiv preprint
- T.-H. Wen, M. Gasic, N. Mrksic, P.-H. Su, D. Vandyke, and S. Young. 2015. Semantically conditioned lstmbased natural language generation for spoken dialogue systems. arXiv preprint arXiv:1508.01745.
- (2015) Semantically Conditioned Lstmbased Natural Language Generation for Spoken Dialogue Systems
- Wen, T.-H.¹ Gasic, M.² Mrksic, N.³ Su, P.-H.⁴ Vandyke, D.⁵ Young, S.⁶

45
- 84976623253
- CoRR, abs/1511.08198
- J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. 2015. Towards universal paraphrastic sentence embeddings. CoRR, abs/1511.08198.
- (2015) Towards Universal Paraphrastic Sentence Embeddings
- Wieting, J.¹ Bansal, M.² Gimpel, K.³ Livescu, K.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.