SCOPUS 정보 검색 플랫폼

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Volumn 07-12-June-2015, Issue , 2015, Pages 2422-2431

Mind's eye: A recurrent visual representation for image caption generation

(2) Chen, Xinlei a Zitnick, C Lawrence b

a CARNEGIE MELLON UNIVERSITY (United States)

b MICROSOFT RESEARCH (United States)

Author keywords

[No Author keywords available]

Indexed keywords

COMPUTER VISION; PATTERN RECOGNITION; RECURRENT NEURAL NETWORKS;

AUTOMATICALLY GENERATED; BI-DIRECTIONAL; IMAGE DESCRIPTIONS; SENTENCE-BASED; STATE OF THE ART; VISUAL CONCEPT; VISUAL FEATURE; VISUAL REPRESENTATIONS;

IMAGE RETRIEVAL;

EID: 84957029470 PISSN: 10636919 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/CVPR.2015.7298856 Document Type: Conference Paper

Times cited : (490)

References (47)

1
- 85116156579
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
- S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65-72, 2005
- (2005) Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization , pp. 65-72
- Banerjee, S.¹ Lavie, A.²

2
- 33845260073
- Neural probabilistic language models
- Springer
- Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. Gauvain. Neural probabilistic language models. In Innova-tions in Machine Learning, pages 137-186. Springer, 2006
- (2006) Innova-tions in Machine Learning , pp. 137-186
- Bengio, Y.¹ Schwenk, H.² Senécal, J.-S.³ Morin, F.⁴ Gauvain, J.-L.⁵

3
- 0028392483
- Learning long-term dependencies with gradient descent is difficult
- Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Neural Net-works, IEEE Transactions on, 5(2):157-166, 1994
- (1994) Neural Net-works, IEEE Transactions on , vol.5 , Issue.2 , pp. 157-166
- Bengio, Y.¹ Simard, P.² Frasconi, P.³

4
- 84952349295
- arXiv preprint arXiv:1504. 00325
- X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollr, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504. 00325, 2015
- (2015) Microsoft Coco Captions: Data Collection and Evaluation Server
- Chen, X.¹ Fang, H.² Lin, T.-Y.³ Vedantam, R.⁴ Gupta, S.⁵ Dollr, P.⁶ Zitnick, C.L.⁷

5
- 84944115859
- arXiv preprint arXiv:1411. 5654
- X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411. 5654, 2014
- (2014) Learning A Recurrent Visual Representation for Image Caption Generation
- Chen, X.¹ Zitnick, C.L.²

6
- 85198028989
- Imagenet: A large-scale hierarchical image database
- IEEE
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248-255. IEEE, 2009
- (2009) Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on , pp. 248-255
- Deng, J.¹ Dong, W.² Socher, R.³ Li, L.-J.⁴ Li, K.⁵ Fei-Fei, L.⁶

7
- 84959236502
- Long-term recurrent convolutional networks for visual recognition and description
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. CVPR, 2015
- (2015) CVPR
- Donahue, J.¹ Hendricks, L.A.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

8
- 85026929617
- D. Elliott and F. Keller. Comparing automatic evaluation measures for image description. 2014
- (2014) Comparing Automatic Evaluation Measures for Image Description.
- Elliott, D.¹ Keller, F.²

9
- 26444565569
- Finding structure in time
- J. L. Elman. Finding structure in time. Cognitive science, 14(2):179-211, 1990
- (1990) Cognitive Science , vol.14 , Issue.2 , pp. 179-211
- Elman, J.L.¹

10
- 84959250180
- From captions to visual concepts and back
- H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, C. L. Zitnick, and G. Zweig. From captions to visual concepts and back. CVPR, 2015
- (2015) CVPR
- Fang, H.¹ Gupta, S.² Iandola, F.³ Srivastava, R.⁴ Deng, L.⁵ Dollár, P.⁶ Gao, J.⁷ He, X.⁸ Mitchell, M.⁹ Platt, J.¹⁰ Zitnick, C.L.¹¹ Zweig, G.¹²

11
- 78149311145
- Every picture tells a story: Generating sentences from images
- Springer
- A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, pages 15-29. Springer, 2010
- (2010) ECCV , pp. 15-29
- Farhadi, A.¹ Hejrati, M.² Sadeghi, M.A.³ Young, P.⁴ Rashtchian, C.⁵ Hockenmaier, J.⁶ Forsyth, D.⁷

12
- 84898958665
- Devise: A deep visual-semantic embedding model
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121-2129, 2013
- (2013) Advances in Neural Information Processing Systems , pp. 2121-2129
- Frome, A.¹ Corrado, G.S.² Shlens, J.³ Bengio, S.⁴ Dean, J.⁵ Mikolov, T.⁶

13
- 84906343066
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 2014
- (2014) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
- Girshick, R.¹ Donahue, J.² Darrell, T.³ Malik, J.⁴

14
- 84906484732
- Improving image-sentence embeddings using large weakly annotated photo collections
- Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, pages 529-545, 2014
- (2014) ECCV , pp. 529-545
- Gong, Y.¹ Wang, L.² Hodosh, M.³ Hockenmaier, J.⁴ Lazebnik, S.⁵

15
- 0031573117
- Long short-term memory
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997
- (1997) Neural Computation , vol.9 , Issue.8 , pp. 1735-1780
- Hochreiter, S.¹ Schmidhuber, J.²

16
- 84883394520
- Framing image description as a ranking task: Data, models and evaluation metrics
- M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. (JAIR), 47:853-899, 2013
- (2013) J. Artif Intell. Res. (JAIR) , vol.47 , pp. 853-899
- Hodosh, M.¹ Young, P.² Hockenmaier, J.³

17
- 84913555165
- arXiv preprint arXiv:1408. 5093
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408. 5093, 2014
- (2014) Caffe: Convolutional Architecture for Fast Feature Embedding
- Jia, Y.¹ Shelhamer, E.² Donahue, J.³ Karayev, S.⁴ Long, J.⁵ Girshick, R.⁶ Guadarrama, S.⁷ Darrell, T.⁸

18
- 1642574427
- Imagery in sentence comprehension: An fmri study
- M. A. Just, S. D. Newman, T. A. Keller, A. McEleney, and P. A. Carpenter. Imagery in sentence comprehension: an fmri study. Neuroimage, 21(1):112-124, 2004
- (2004) Neuroimage , vol.21 , Issue.1 , pp. 112-124
- Just, M.A.¹ Newman, S.D.² Keller, T.A.³ McEleney, A.⁴ Carpenter, P.A.⁵

19
- 84946734827
- Deep visual-semantic alignments for generating image descriptions
- A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. CVPR, 2015
- (2015) CVPR
- Karpathy, A.¹ Fei-Fei, L.²

20
- 84959252592
- arXiv preprint arXiv:1406. 5679
- A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406. 5679, 2014
- (2014) Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
- Karpathy, A.¹ Joulin, A.² Fei-Fei, L.³

21
- 84919921461
- Multimodal neural language models
- R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In ICML, 2014
- (2014) ICML
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.³

22
- 84944113729
- arXiv preprint arXiv:1411. 2539
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411. 2539, 2014
- (2014) Unifying Visual-semantic Embeddings with Multimodal Neural Language Models
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.S.³

23
- 84876231242
- Imagenet classification with deep convolutional neural networks
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097-1105, 2012
- (2012) Advances in Neural Information Processing Systems , pp. 1097-1105
- Krizhevsky, A.¹ Sutskever, I.² Hinton, G.E.³

24
- 80052901011
- Baby talk: Understanding and generating simple image descriptions
- IEEE
- G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, pages 1601-1608. IEEE, 2011
- (2011) CVPR , pp. 1601-1608
- Kulkarni, G.¹ Premraj, V.² Dhar, S.³ Li, S.⁴ Choi, Y.⁵ Berg, A.C.⁶ Berg, T.L.⁷

25
- 84926045450
- P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Collective generation of natural image descriptions. 2012
- (2012) Collective Generation of Natural Image Descriptions.
- Kuznetsova, P.¹ Ordonez, V.² Berg, A.C.³ Berg, T.L.⁴ Choi, Y.⁵

26
- 0013828836
- Words versus objects: Comparison of free verbal recall
- L. R. Lieberman and J. T. Culpepper. Words versus objects: Comparison of free verbal recall. Psychological Reports, 17(3):983-988, 1965
- (1965) Psychological Reports , vol.17 , Issue.3 , pp. 983-988
- Lieberman, L.R.¹ Culpepper, J.T.²

27
- 84937834115
- Microsoft coco: Common objects in context
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014
- (2014) ECCV
- Lin, T.-Y.¹ Maire, M.² Belongie, S.³ Hays, J.⁴ Perona, P.⁵ Ramanan, D.⁶ Dollár, P.⁷ Zitnick, C.L.⁸

28
- 84951072975
- arXiv preprint arXiv:1410. 1090
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410. 1090, 2014
- (2014) Explain Images with Multimodal Recurrent Neural Networks
- Mao, J.¹ Xu, W.² Yang, Y.³ Wang, J.⁴ Yuille, A.L.⁵

29
- 80051627816
- T. Mikolov. Recurrent neural network based language model
- Recurrent Neural Network Based Language Model
- Mikolov, T.¹

30
- 85083951332
- Efficient estimation of word representations in vector space
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. International Conference on Learning Representations: Workshops Track, 2013
- (2013) International Conference on Learning Representations: Workshops Track
- Mikolov, T.¹ Chen, K.² Corrado, G.³ Dean, J.⁴

31
- 84858966958
- Strategies for training large scale neural network language models
- IEEE
- T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernocky. Strategies for training large scale neural network language models. In Automatic Speech Recognition and Understand-ing (ASRU), 2011 IEEE Workshop on, pages 196-201. IEEE, 2011
- (2011) Automatic Speech Recognition and Understand-ing (ASRU), 2011 IEEE Workshop on , pp. 196-201
- Mikolov, T.¹ Deoras, A.² Povey, D.³ Burget, L.⁴ Cernocky, J.⁵

32
- 84874235486
- Context dependent recurrent neural network language model
- T. Mikolov and G. Zweig. Context dependent recurrent neural network language model. In SLT, pages 234-239, 2012
- (2012) SLT , pp. 234-239
- Mikolov, T.¹ Zweig, G.²

33
- 85034832841
- Midge: Generating image descriptions from computer vision detections
- Association for Computational Linguistics
- M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. Midge: Generating image descriptions from computer vision detections. In EACL, pages 747-756. Association for Computational Linguistics, 2012
- (2012) EACL , pp. 747-756
- Mitchell, M.¹ Han, X.² Dodge, J.³ Mensch, A.⁴ Goyal, A.⁵ Berg, A.⁶ Yamaguchi, K.⁷ Berg, T.⁸ Stratos, K.⁹ Daumé, H.¹⁰

34
- 80054092539
- Why are pictures easier to recall than words
- A. Paivio, T. B. Rogers, and P. C. Smythe. Why are pictures easier to recall than words Psychonomic Science, 11(4):137-138, 1968
- (1968) Psychonomic Science , vol.11 , Issue.4 , pp. 137-138
- Paivio, A.¹ Rogers, T.B.² Smythe, P.C.³

35
- 85133336275
- Bleu: A method for automatic evaluation of machine translation
- Association for Computational Linguistics
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311-318. Association for Computational Linguistics, 2002
- (2002) Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , pp. 311-318
- Papineni, K.¹ Roukos, S.² Ward, T.³ Zhu, W.-J.⁴

36
- 85090348677
- Collecting image annotations using Amazon's mechanical turk
- C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using Amazon's mechanical turk. In NAACL HLT Workshop Creating Speech and Lan-guage Data with Amazon's Mechanical Turk, 2010
- (2010) NAACL HLT Workshop Creating Speech and Lan-guage Data with Amazon's Mechanical Turk
- Rashtchian, C.¹ Young, P.² Hodosh, M.³ Hockenmaier, J.⁴

37
- 84925410541
- arXiv preprint arXiv:1409. 1556
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409. 1556, 2014
- (2014) Very Deep Convolutional Networks for Large-scale Image Recognition
- Simonyan, K.¹ Zisserman, A.²

38
- 84928030723
- Grounded compositional semantics for finding and describing images with sentences
- R. Socher, Q. Le, C. Manning, and A. Ng. Grounded compositional semantics for finding and describing images with sentences. In NIPS Deep Learning Workshop, 2013
- (2013) NIPS Deep Learning Workshop
- Socher, R.¹ Le, Q.² Manning, C.³ Ng, A.⁴

39
- 80053459857
- Generating text with recurrent neural networks
- I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017-1024, 2011
- (2011) Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pp. 1017-1024
- Sutskever, I.¹ Martens, J.² Hinton, G.E.³

40
- 84956980995
- Cider: Consensus-based image description evaluation
- R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. CVPR, 2015
- (2015) CVPR
- Vedantam, R.¹ Zitnick, C.L.² Parikh, D.³

41
- 56449089103
- Extracting and composing robust features with denoising autoencoders
- ACM
- P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international confer-ence on Machine learning, pages 1096-1103. ACM, 2008
- (2008) Proceedings of the 25th International Confer-ence on Machine Learning , pp. 1096-1103
- Vincent, P.¹ Larochelle, H.² Bengio, Y.³ Manzagol, P.-A.⁴

42
- 84946747440
- Show and tell: A neural image caption generator
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CVPR, 2015
- (2015) CVPR
- Vinyals, O.¹ Toshev, A.² Bengio, S.³ Erhan, D.⁴

43
- 0003066062
- Experimental analysis of the real-time recurrent learning algorithm
- R. J. Williams and D. Zipser. Experimental analysis of the real-time recurrent learning algorithm. Connection Science, 1(1):87-111, 1989
- (1989) Connection Science , vol.1 , Issue.1 , pp. 87-111
- Williams, R.J.¹ Zipser, D.²

44
- 84939821074
- arXiv preprint arXiv:1502. 03044
- K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502. 03044, 2015
- (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Xu, K.¹ Ba, J.² Kiros, R.³ Courville, A.⁴ Salakhutdinov, R.⁵ Zemel, R.⁶ Bengio, Y.⁷

45
- 80053258778
- Corpus-guided sentence generation of natural images
- Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, 2011
- (2011) EMNLP
- Yang, Y.¹ Teo, C.L.² Daumé, H.³ Aloimonos, Y.⁴

46
- 77954862144
- I2T: Image parsing to text description
- B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2T: Image parsing to text description. Proceedings of the IEEE, 98(8):1485-1508, 2010
- (2010) Proceedings of the IEEE , vol.98 , Issue.8 , pp. 1485-1508
- Yao, B.Z.¹ Yang, X.² Lin, L.³ Lee, M.W.⁴ Zhu, S.-C.⁵

47
- 84887338442
- Bringing semantics into focus using visual abstraction
- IEEE
- C. L. Zitnick and D. Parikh. Bringing semantics into focus using visual abstraction. In Computer Vision and Pat-tern Recognition (CVPR), 2013 IEEE Conference on, pages 3009-3016. IEEE, 2013.
- (2013) Computer Vision and Pat-tern Recognition (CVPR), 2013 IEEE Conference on , pp. 3009-3016
- Zitnick, C.L.¹ Parikh, D.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.