SCOPUS 정보 검색 플랫폼

IEEE Transactions on Pattern Analysis and Machine Intelligence

Volumn 39, Issue 4, 2017, Pages 652-663

Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge

(4) Vinyals, Oriol a Toshev, Alexander a Bengio, Samy a Erhan, Dumitru a

a GOOGLE INC (United States)

Author keywords

Image captioning; language model; recurrent neural network; sequence to sequence

Indexed keywords

COMPUTATIONAL LINGUISTICS; NATURAL LANGUAGE PROCESSING SYSTEMS; RECURRENT NEURAL NETWORKS;

IMAGE CAPTIONING; IMAGE DESCRIPTIONS; LANGUAGE MODEL; MACHINE TRANSLATIONS; MICROSOFT RESEARCHES; NATURAL LANGUAGE PROCESSING; SEQUENCE-TO-SEQUENCE; TARGET DESCRIPTIONS;

COMPUTER VISION;

EID: 85015770940 PISSN: 01628828 EISSN: None Source Type: Journal
DOI: 10.1109/TPAMI.2016.2587640 Document Type: Article

Times cited : (846)

References (49)

1
- 84947041871
- ImageNet large scale visual recognition challenge
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, "ImageNet large scale visual recognition challenge," Int. J. Comput. Vis. (IJCV), vol. 115, no. 3, pp. 211-252, 2015.
- (2015) Int. J. Comput. Vis. (IJCV) , vol.115 , Issue.3 , pp. 211-252
- Russakovsky, O.¹ Deng, J.² Su, H.³ Krause, J.⁴ Satheesh, S.⁵ Ma, S.⁶ Huang, Z.⁷ Karpathy, A.⁸ Khosla, A.⁹ Bernstein, M.¹⁰ Berg, A.C.¹¹ Fei-Fei, L.¹²

2
- 78149311145
- Every picture tells a story: Generating sentences from images
- A. Farhadi "Every picture tells a story: Generating sentences from images," in Proc. 11th Eur. Conf. Comput. Vis.: Part IV, 2010, pp. 15-29.
- (2010) Proc. 11th Eur. Conf. Comput. Vis.: Part IV , pp. 15-29
- Farhadi, A.¹

3
- 80052901011
- Baby talk: Understanding and generating simple image descriptions
- G. Kulkarni, "Baby talk: Understanding and generating simple image descriptions," in Proc. IEEE Conf Comput. Vis. Pattern Recog., 2011, pp. 1601-1608.
- (2011) Proc. IEEE Conf Comput. Vis. Pattern Recog. , pp. 1601-1608
- Kulkarni, G.¹

4
- 84961291190
- Learning phrase representations using RNN encoder-decoder for statistical machine translation
- K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using RNN encoder-decoder for statistical machine translation," in Proc. Empirical Methods Natural Lang. Process., 2014.
- (2014) Proc. Empirical Methods Natural Lang. Process
- Cho, K.¹ Van Merrienboer, B.² Gulcehre, C.³ Bougares, F.⁴ Schwenk, H.⁵ Bengio, Y.⁶

5
- 84922389693
- arXiv:1409.0473
- D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv:1409.0473, 2014.
- (2014) Neural Machine Translation by Jointly Learning to Align and Translate
- Bahdanau, D.¹ Cho, K.² Bengio, Y.³

6
- 84928547704
- Sequence to sequence learning with neural networks
- I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Proc. Neural Inf. Process. Syst., 2014.
- (2014) Proc. Neural Inf. Process. Syst.
- Sutskever, I.¹ Vinyals, O.² Le, Q.V.³

7
- 84906347546
- arXiv:1312.6229
- P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "Overfeat: Integrated recognition, localization and detection using convolutional networks," arXiv:1312.6229, 2013.
- (2013) Overfeat: Integrated Recognition, Localization and Detection Using Convolutional Networks
- Sermanet, P.¹ Eigen, D.² Zhang, X.³ Mathieu, M.⁴ Fergus, R.⁵ LeCun, Y.⁶

8
- 0030397830
- Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences
- R. Gerber and H.-H. Nagel, "Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences," in Proc. Int. Conf. Image Process, 1996, pp. 805-808.
- (1996) Proc. Int. Conf. Image Process , pp. 805-808
- Gerber, R.¹ Nagel, H.-H.²

9
- 77954862144
- I2t: Image parsing to text description
- Aug.
- B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu, "I2t: Image parsing to text description," in Proc. IEEE, vol. 98, no. 8, pp. 1485-1508, Aug. 2010.
- (2010) Proc. IEEE , vol.98 , Issue.8 , pp. 1485-1508
- Yao, B.Z.¹ Yang, X.² Lin, L.³ Lee, M.W.⁴ Zhu, S.-C.⁵

10
- 84862279067
- Composing simple image descriptions using web-scale n-grams
- S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, "Composing simple image descriptions using web-scale n-grams," in Proc. Conf. Comput. Natural Lang. Learn., 2011, pp. 220-228.
- (2011) Proc. Conf. Comput. Natural Lang. Learn. , pp. 220-228
- Li, S.¹ Kulkarni, G.² Berg, T.L.³ Berg, A.C.⁴ Choi, Y.⁵

11
- 85034832841
- Midge: Generating image descriptions from computer vision detections
- M. Mitchell, "Midge: Generating image descriptions from computer vision detections," in Proc. 13th Conf. Eur. Chapter Assoc. Comput. Linguistics, 2012, pp. 747-756.
- (2012) Proc. 13th Conf. Eur. Chapter Assoc. Comput. Linguistics , pp. 747-756
- Mitchell, M.¹

12
- 80052886947
- Generating image descriptions using dependency relational patterns
- A. Aker and R. Gaizauskas, "Generating image descriptions using dependency relational patterns," in Proc. 48th Annu. Meet. Assoc. Comput. Linguistics, 2010, pp. 1250-1258.
- (2010) Proc. 48th Annu. Meet. Assoc. Comput. Linguistics , pp. 1250-1258
- Aker, A.¹ Gaizauskas, R.²

13
- 84878189119
- Collective generation of natural image descriptions
- P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, "Collective generation of natural image descriptions," in Proc. 50th Annu. Meet. Assoc. Comput. Linguistics: Long Papers-Vol. 1, 2012, pp. 359-368.
- (2012) Proc. 50th Annu. Meet. Assoc. Comput. Linguistics: Long Papers-Vol. 1 , pp. 359-368
- Kuznetsova, P.¹ Ordonez, V.² Berg, A.C.³ Berg, T.L.⁴ Choi, Y.⁵

14
- 84934873221
- Treetalk: Composition and compression of trees for image descriptions
- P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi, "Treetalk: Composition and compression of trees for image descriptions," in Proc. Assoc. Comput. Linguistics, vol. 2, no. 10, 2014.
- (2014) Proc. Assoc. Comput. Linguistics , vol.2 , Issue.10
- Kuznetsova, P.¹ Ordonez, V.² Berg, T.³ Choi, Y.⁴

15
- 84906929591
- Image description using visual dependency representations
- D. Elliott and F. Keller, "Image description using visual dependency representations," in Proc. Conf. Empirical Methods Natural Lang. Process., 2013, pp. 1292-1302.
- (2013) Proc. Conf. Empirical Methods Natural Lang. Process , pp. 1292-1302
- Elliott, D.¹ Keller, F.²

16
- 84883394520
- Framing image description as a ranking task: Data, models and evaluation metrics
- M. Hodosh, P. Young, and J. Hockenmaier, "Framing image description as a ranking task: Data, models and evaluation metrics," J. Artif. Intell. Res., vol. 47, pp. 853-899, 2013.
- (2013) J. Artif. Intell. Res. , vol.47 , pp. 853-899
- Hodosh, M.¹ Young, P.² Hockenmaier, J.³

17
- 84906484732
- Improving image-sentence embeddings using large weakly annotated photo collections
- Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik, "Improving image-sentence embeddings using large weakly annotated photo collections," in Proc. Eur. Conf. Comput. Vis., 2014, pp. 529-545.
- (2014) Proc. Eur. Conf. Comput. Vis. , pp. 529-545
- Gong, Y.¹ Wang, L.² Hodosh, M.³ Hockenmaier, J.⁴ Lazebnik, S.⁵

18
- 85162522202
- Im2text: Describing images using 1 million captioned photographs
- V. Ordonez, G. Kulkarni, and T. L. Berg, "Im2text: Describing images using 1 million captioned photographs," in Proc. Neural Inf. Process. Syst., 2011.
- (2011) Proc. Neural Inf. Process. Syst.
- Ordonez, V.¹ Kulkarni, G.² Berg, T.L.³

19
- 84965102873
- arXiv:1505.04467
- J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick, "Exploring nearest neighbor approaches for image captioning," arXiv:1505.04467, 2015.
- (2015) Exploring Nearest Neighbor Approaches for Image Captioning
- Devlin, J.¹ Gupta, S.² Girshick, R.³ Mitchell, M.⁴ Zitnick, C.L.⁵

20
- 84990070245
- arXiv:1506.03995
- M. Kolář, M. Hradiš, and P. Zemčík, "Technical report: Image captioning with semantically similar images," arXiv:1506.03995, 2015.
- (2015) Technical Report: Image Captioning with Semantically Similar Images
- Kolář, M.¹ Hradiš, M.² Zemčík, P.³

21
- 84906925854
- Grounded compositional semantics for finding and describing images with sentences
- R. Socher, A. Karpathy, Q. V. Le, C. Manning, and A. Y. Ng, "Grounded compositional semantics for finding and describing images with sentences," in Proc. Assoc. Comput. Linguistics, 2014.
- (2014) Proc. Assoc. Comput. Linguistics
- Socher, R.¹ Karpathy, A.² Le, Q.V.³ Manning, C.⁴ Ng, A.Y.⁵

22
- 84937843643
- Deep fragment embeddings for bidirectional image sentence mapping
- A. Karpathy, A. Joulin, and L. Fei-Fei, "Deep fragment embeddings for bidirectional image sentence mapping," in Proc. Neural Inf. Process. Syst., 2014.
- (2014) Proc. Neural Inf. Process. Syst.
- Karpathy, A.¹ Joulin, A.² Fei-Fei, L.³

23
- 84959250180
- From captions to visual concepts and back
- H. Fang, "From captions to visual concepts and back," in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015.
- (2015) Proc. IEEE Conf. Comput. Vis. Pattern Recog.
- Fang, H.¹

24
- 84969584486
- Batch normalization: Accelerating deep network training by reducing internal covariate shift
- S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," in Proc. Int. Conf. Mach. Learn., 2015.
- (2015) Proc. Int. Conf. Mach. Learn.
- Ioffe, S.¹ Szegedy, C.²

25
- 0031573117
- Long short-term memory
- S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
- (1997) Neural Comput. , vol.9 , Issue.8 , pp. 1735-1780
- Hochreiter, S.¹ Schmidhuber, J.²

26
- 84906924453
- Multimodal neural language models
- R. Kiros and R. Z. R. Salakhutdinov, "Multimodal neural language models," in Proc. Neural Inf. Process. Syst. Deep Learn. Workshop, 2013.
- (2013) Proc. Neural Inf. Process. Syst. Deep Learn. Workshop
- Kiros, R.¹ Salakhutdinov, R.Z.R.²

27
- 84951072975
- in arXiv:1410.1090
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille, "Explain images with multimodal recurrent neural networks," in arXiv:1410.1090, 2014.
- (2014) Explain Images with Multimodal Recurrent Neural Networks
- Mao, J.¹ Xu, W.² Yang, Y.³ Wang, J.⁴ Yuille, A.⁵

28
- 85083950512
- Deep captioning with multimodal recurrent neural networks (m-RNN)
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille, "Deep captioning with multimodal recurrent neural networks (m-RNN)," Int. Conf. Learn. Representations, 2015.
- (2015) Int. Conf. Learn. Representations
- Mao, J.¹ Xu, W.² Yang, Y.³ Wang, J.⁴ Yuille, A.⁵

29
- 84952349298
- Unifying visualsemantic embeddings with multimodal neural language models
- R. Kiros, R. Salakhutdinov, and R. S. Zemel, "Unifying visualsemantic embeddings with multimodal neural language models," Trans. Assoc. Comput. Linguistics, 2015.
- (2015) Trans. Assoc. Comput. Linguistics
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.S.³

30
- 84959236502
- Long-term recurrent convolutional networks for visual recognition and description
- J. Donahue, "Long-term recurrent convolutional networks for visual recognition and description," in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015.
- (2015) Proc. IEEE Conf. Comput. Vis. Pattern Recog.
- Donahue, J.¹

31
- 84970002232
- Show, attend and tell: Neural image caption generation with visual attention
- K. Xu, "Show, attend and tell: Neural image caption generation with visual attention," in Proc. Int. Conf. Mach. Learn., 2015.
- (2015) Proc. Int. Conf. Mach. Learn.
- Xu, K.¹

32
- 84946734827
- Deep visual-semantic alignments for generating image descriptions
- A. Karpathy and F.-F. Li, "Deep visual-semantic alignments for generating image descriptions," in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015.
- (2015) Proc. IEEE Conf. Comput. Vis. Pattern Recog.
- Karpathy, A.¹ Li, F.-F.²

33
- 85015796277
- Mind's eye: A recurrent visual representation for image caption generation
- X. Chen and C. L. Zitnick, "Mind's eye: A recurrent visual representation for image caption generation," Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
- (1997) Neural Comput. , vol.9 , Issue.8 , pp. 1735-1780
- Chen, X.¹ Zitnick, C.L.²

34
- 84944096380
- Language models for image captioning: The quirks and what works
- J. Devlin, "Language models for image captioning: The quirks and what works," in Proc. Assoc. Comput. Linguistics, 2015.
- (2015) Proc. Assoc. Comput. Linguistics
- Devlin, J.¹

35
- 84919881041
- Decaf: A deep convolutional activation feature for generic visual recognition
- J. Donahue, et al., "Decaf: A deep convolutional activation feature for generic visual recognition," in Proc. Int. Conf. Mach. Learn., 2014.
- (2014) Proc. Int. Conf. Mach. Learn.
- Donahue, J.¹

36
- 85083951332
- Efficient estimation of word representations in vector space
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," in Int. Conf. Learn. Representations, 2013.
- (2013) Int. Conf. Learn. Representations
- Mikolov, T.¹ Chen, K.² Corrado, G.³ Dean, J.⁴

37
- 84906979661
- arXiv:1308.0850
- A. Graves, "Generating sequenceswith recurrent neural networks," arXiv:1308.0850, 2013.
- (2013) Generating Sequenceswith Recurrent Neural Networks
- Graves, A.¹

38
- 85133336275
- BLEU: A method for automatic evaluation of machine translation
- K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, "BLEU: A method for automatic evaluation of machine translation," in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, 2002, pp. 311-318.
- (2002) Proc. 40th Annu. Meeting Assoc. Comput. Linguistics , pp. 311-318
- Papineni, K.¹ Roukos, S.² Ward, T.³ Zhu, W.J.⁴

39
- 84959197551
- in arXiv:1411.5726
- R. Vedantam, C. L. Zitnick, and D. Parikh, "CIDEr: Consensusbased image description evaluation," in arXiv:1411.5726, 2015.
- (2015) CIDEr: Consensusbased Image Description Evaluation
- Vedantam, R.¹ Zitnick, C.L.² Parikh, D.³

40
- 85116156579
- Meteor: An automatic metric for MT evaluation with improved correlation with human judgments
- S. Banerjee and A. Lavie, "Meteor: An automatic metric for MT evaluation with improved correlation with human judgments," in Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summarization, vol. 29, 2005, pp. 65-72.
- (2005) Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summarization , vol.29 , pp. 65-72
- Banerjee, S.¹ Lavie, A.²

41
- 26944501715
- Rouge: A package for automatic evaluation of summaries
- C.-Y. Lin, "Rouge: A package for automatic evaluation of summaries," in Proc. ACL Workshop Text Summarization Branches Out, vol. 8, 2004.
- (2004) Proc. ACL Workshop Text Summarization Branches Out , vol.8
- Lin, C.-Y.¹

42
- 85090348677
- Collecting image annotations using Amazon's Mechanical Turk
- C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, "Collecting image annotations using Amazon's Mechanical Turk," in Proc. NAACL HLT Workshop Creating Speech Lang. Data Amazon's Mech. Turk, 2010, pp. 139-147.
- (2010) Proc. NAACL HLT Workshop Creating Speech Lang. Data Amazon's Mech. Turk , pp. 139-147
- Rashtchian, C.¹ Young, P.² Hodosh, M.³ Hockenmaier, J.⁴

43
- 84906494296
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, "From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions," Trans. Assoc. Comput. Linguistics, vol. 2, pp. 67-78, 2014.
- (2014) Trans. Assoc. Comput. Linguistics , vol.2 , pp. 67-78
- Young, P.¹ Lai, A.² Hodosh, M.³ Hockenmaier, J.⁴

44
- 84937834115
- arXiv:1405.0312
- T.-Y. Lin, et al., "Microsoft COCO: Common objects in context," arXiv:1405.0312, 2014.
- (2014) Microsoft COCO: Common Objects in Context
- Lin, T.-Y.¹

45
- 84944053926
- in arXiv:1409.2329
- W. Zaremba, I. Sutskever, and O. Vinyals, "Recurrent neural network regularization," in arXiv:1409.2329, 2014.
- (2014) Recurrent Neural Network Regularization
- Zaremba, W.¹ Sutskever, I.² Vinyals, O.³

46
- 84946747440
- Show and tell: A neural image caption generator
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 3156-3164.
- (2015) Proc. IEEE Conf. Comput. Vis. Pattern Recog. , pp. 3156-3164
- Vinyals, O.¹ Toshev, A.² Bengio, S.³ Erhan, D.⁴

47
- 84965179228
- Scheduled sampling for sequence prediction with recurrent neural networks
- S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, "Scheduled sampling for sequence prediction with recurrent neural networks," Adv. Neural Inf. Process. Syst., pp. 1171-1179, 2015.
- (2015) Adv. Neural Inf. Process. Syst. , pp. 1171-1179
- Bengio, S.¹ Vinyals, O.² Jaitly, N.³ Shazeer, N.⁴

48
- 84964983441
- in arXiv:1409.4842
- C. Szegedy, et al., "Going deeper with convolutions," in arXiv:1409.4842, 2014.
- (2014) Going Deeper with Convolutions
- Szegedy, C.¹

49
- 0030211964
- Bagging predictors
- L. Breiman, "Bagging predictors," in Proc. Mach. Learn., vol. 24, 1996, pp. 123-140.
- (1996) Proc. Mach. Learn. , vol.24 , pp. 123-140
- Breiman, L.¹

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.