SCOPUS 정보 검색 플랫폼

3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings

Volumn , Issue , 2015, Pages

Deep captioning with multimodal recurrent neural networks (m-RNN)

(6) Mao, Junhua a Xu, Wei b Yang, Yi b Wang, Jiang b Huang, Zhiheng b Yuille, Alan a

a UNIVERSITY OF CALIFORNIA (United States)

b BAIDU INC (China)

Author keywords

[No Author keywords available]

Indexed keywords

IMAGE ENHANCEMENT; PROBABILITY DISTRIBUTIONS;

BENCHMARK DATASETS; CONVOLUTIONAL NETWORKS; DIRECTLY MODEL; IMAGE CAPTION; MULTI-MODAL; OBJECTIVE FUNCTIONS; STATE-OF-THE-ART METHODS; SUB-NETWORK;

RECURRENT NEURAL NETWORKS;

EID: 85083950512 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: None Document Type: Conference Paper

Times cited : (628)

References (48)

1
- 0041876117
- Matching words and pictures
- Barnard, Kobus, Duygulu, Pinar, Forsyth, David, De Freitas, Nando, Blei, David M, and Jordan, Michael I. Matching words and pictures. JMLR, 3:1107–1135, 2003.
- (2003) JMLR , vol.3 , pp. 1107-1135
- Barnard, K.¹ Duygulu, P.² Forsyth, D.³ De Freitas, N.⁴ Blei, D.M.⁵ Jordan, M.I.⁶

2
- 84952349295
- arXiv preprint
- Chen, X., Fang, H., Lin, TY, Vedantam, R., Gupta, S., Dollr, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- (2015) Microsoft Coco Captions: Data Collection and Evaluation Server
- Chen, X.¹ Fang, H.² Lin, T.Y.³ Vedantam, R.⁴ Gupta, S.⁵ Dollr, P.⁶ Zitnick, C.L.⁷

3
- 84944115859
- arXiv preprint
- Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.
- (2014) Learning A Recurrent Visual Representation for Image Caption Generation
- Chen, X.¹ Zitnick, C.L.²

4
- 84919728106
- arXiv preprint
- Cho, Kyunghyun, van Merrienboer, Bart, Gulcehre, Caglar, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- (2014) Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation
- Cho, K.¹ van Merrienboer, B.² Gulcehre, C.³ Bougares, F.⁴ Schwenk, H.⁵ Bengio, Y.⁶

5
- 84952349296
- arXiv preprint
- Devlin, Jacob, Cheng, Hao, Fang, Hao, Gupta, Saurabh, Deng, Li, He, Xiaodong, Zweig, Geoffrey, and Mitchell, Margaret. Language models for image captioning: The quirks and what works. arXiv preprint arXiv:1505.01809, 2015a.
- (2015) Language Models for Image Captioning: The Quirks and What Works
- Devlin, J.¹ Cheng, H.² Fang, H.³ Gupta, S.⁴ Deng, L.⁵ He, X.⁶ Zweig, G.⁷ Mitchell, M.⁸

6
- 84965102873
- arXiv preprint
- Devlin, Jacob, Gupta, Saurabh, Girshick, Ross, Mitchell, Margaret, and Zitnick, C Lawrence. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015b.
- (2015) Exploring Nearest Neighbor Approaches for Image Captioning
- Devlin, J.¹ Gupta, S.² Girshick, R.³ Mitchell, M.⁴ Zitnick, C.L.⁵

7
- 84944046597
- arXiv preprint
- Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.
- (2014) Long-Term Recurrent Convolutional Networks for Visual Recognition and Description
- Donahue, J.¹ Hendricks, L.A.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

8
- 26444565569
- Finding structure in time
- Elman, Jeffrey L. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
- (1990) Cognitive Science , vol.14 , Issue.2 , pp. 179-211
- Elman, J.L.¹

9
- 84944115860
- arXiv preprint
- Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014.
- (2014) From Captions to Visual Concepts and Back
- Fang, H.¹ Gupta, S.² Iandola, F.³ Srivastava, R.⁴ Deng, L.⁵ Dollár, P.⁶ Gao, J.⁷ He, X.⁸ Mitchell, M.⁹ Platt, J.¹⁰

10
- 78149311145
- Every picture tells a story: Generating sentences from images
- Farhadi, Ali, Hejrati, Mohsen, Sadeghi, Mohammad Amin, Young, Peter, Rashtchian, Cyrus, Hockenmaier, Julia, and Forsyth, David. Every picture tells a story: Generating sentences from images. In ECCV, pp. 15–29. 2010.
- (2010) ECCV , pp. 15-29
- Farhadi, A.¹ Hejrati, M.² Sadeghi, M.A.³ Young, P.⁴ Rashtchian, C.⁵ Hockenmaier, J.⁶ Forsyth, D.⁷

11
- 84898958665
- Devise: A deep visual-semantic embedding model
- Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeff, Mikolov, Tomas, et al. Devise: A deep visual-semantic embedding model. In NIPS, pp. 2121–2129, 2013.
- (2013) NIPS , pp. 2121-2129
- Frome, A.¹ Corrado, G.S.² Shlens, J.³ Bengio, S.⁴ Dean, J.⁵ Mikolov, T.⁶

12
- 84911400494
- Rich feature hierarchies for accurate object detection and semantic segmentation
- Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- (2014) CVPR
- Girshick, R.¹ Donahue, J.² Darrell, T.³ Malik, J.⁴

13
- 38049183286
- The iapr tc-12 benchmark: A new evaluation resource for visual information systems
- Grubinger, Michael, Clough, Paul, Müller, Henning, and Deselaers, Thomas. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International Workshop OntoImage, pp. 13–23, 2006.
- (2006) International Workshop OntoImage , pp. 13-23
- Grubinger, M.¹ Clough, P.² Müller, H.³ Deselaers, T.⁴

14
- 78149341381
- Multiple instance metric learning from automatically labeled bags of faces
- Guillaumin, Matthieu, Verbeek, Jakob, and Schmid, Cordelia. Multiple instance metric learning from automatically labeled bags of faces. In ECCV, pp. 634–647, 2010.
- (2010) ECCV , pp. 634-647
- Guillaumin, M.¹ Verbeek, J.² Schmid, C.³

15
- 84973931408
- From image annotation to image description
- Gupta, Ankush and Mannem, Prashanth. From image annotation to image description. In ICONIP, 2012.
- (2012) ICONIP
- Gupta, A.¹ Mannem, P.²

16
- 85059866463
- Choosing linguistics over vision to describe images
- Gupta, Ankush, Verma, Yashaswi, and Jawahar, CV. Choosing linguistics over vision to describe images. In AAAI, 2012.
- (2012) AAAI
- Gupta, A.¹ Verma, Y.² Jawahar, C.V.³

17
- 0031573117
- Long short-term memory
- Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997.
- (1997) Neural Computation , vol.9 , Issue.8 , pp. 1735-1780
- Hochreiter, S.¹ Schmidhuber, J.²

18
- 84883394520
- Framing image description as a ranking task: Data, models and evaluation metrics
- Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47:853–899, 2013.
- (2013) JAIR , vol.47 , pp. 853-899
- Hodosh, M.¹ Young, P.² Hockenmaier, J.³

19
- 84856653718
- Learning cross-modality similarity for multinomial data
- Jia, Yangqing, Salzmann, Mathieu, and Darrell, Trevor. Learning cross-modality similarity for multinomial data. In ICCV, pp. 2407–2414, 2011.
- (2011) ICCV , pp. 2407-2414
- Jia, Y.¹ Salzmann, M.² Darrell, T.³

20
- 84926283798
- Recurrent continuous translation models
- Kalchbrenner, Nal and Blunsom, Phil. Recurrent continuous translation models. In EMNLP, pp. 1700–1709, 2013.
- (2013) EMNLP , pp. 1700-1709
- Kalchbrenner, N.¹ Blunsom, P.²

21
- 84942676733
- arXiv preprint
- Karpathy, Andrej and Fei-Fei, Li. Deep visual-semantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306, 2014.
- (2014) Deep Visual-Semantic Alignments for Generating Image Descriptions
- Karpathy, A.¹ Fei-Fei, L.²

22
- 84959252592
- Karpathy, Andrej, Joulin, Armand, and Fei-Fei, Li. Deep fragment embeddings for bidirectional image sentence mapping. In arXiv:1406.5679, 2014.
- (2014) Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
- Karpathy, A.¹ Joulin, A.² Fei-Fei, L.³

23
- 84944113729
- arXiv preprint
- Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014a.
- (2014) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.S.³

24
- 84919921461
- Multimodal neural language models
- Kiros, Ryan, Zemel, R, and Salakhutdinov, Ruslan. Multimodal neural language models. In ICML, 2014b.
- (2014) ICML
- Kiros, R.¹ Zemel, R.² Salakhutdinov, R.³

25
- 84876231242
- Imagenet classification with deep convolutional neural networks
- Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105, 2012.
- (2012) NIPS , pp. 1097-1105
- Krizhevsky, A.¹ Sutskever, I.² Hinton, G.E.³

26
- 80052901011
- Baby talk: Understanding and generating image descriptions
- Kulkarni, Girish, Premraj, Visruth, Dhar, Sagnik, Li, Siming, Choi, Yejin, Berg, Alexander C, and Berg, Tamara L. Baby talk: Understanding and generating image descriptions. In CVPR, 2011.
- (2011) CVPR
- Kulkarni, G.¹ Premraj, V.² Dhar, S.³ Li, S.⁴ Choi, Y.⁵ Berg, A.C.⁶ Berg, T.L.⁷

27
- 84934873221
- TreeTalk: Composition and compression of trees for image descriptions
- Kuznetsova, Polina, Ordonez, Vicente, Berg, Tamara L, and Choi, Yejin. Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2(10):351–362, 2014.
- (2014) Transactions of the Association for Computational Linguistics , vol.2 , Issue.10 , pp. 351-362
- Kuznetsova, P.¹ Ordonez, V.² Berg, T.L.³ Choi, Y.⁴

28
- 84872543023
- Efficient backprop
- Springer
- LeCun, Yann A, Bottou, Léon, Orr, Genevieve B, and Müller, Klaus-Robert. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9–48. Springer, 2012.
- (2012) Neural Networks: Tricks of the Trade , pp. 9-48
- LeCun, Y.A.¹ Bottou, L.² Orr, G.B.³ Müller, K.-R.⁴

29
- 84937834115
- arXiv preprint
- Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Dollár, Piotr, and Zitnick, C Lawrence. Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2014.
- (2014) Microsoft Coco: Common Objects in Context
- Lin, T.-Y.¹ Maire, M.² Belongie, S.³ Hays, J.⁴ Perona, P.⁵ Ramanan, D.⁶ Dollár, P.⁷ Zitnick, C.L.⁸

30
- 84951072975
- Explain images with multimodal recurrent neural networks
- Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan L. Explain images with multimodal recurrent neural networks. NIPS DeepLearning Workshop, 2014.
- (2014) NIPS DeepLearning Workshop
- Mao, J.¹ Xu, W.² Yang, Y.³ Wang, J.⁴ Yuille, A.L.⁵

31
- 84965160495
- arXiv preprint
- Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, Huang, Zhiheng, and Yuille, Alan. Learning like a child: Fast novel visual concept learning from sentence descriptions of images. arXiv preprint arXiv:1504.06692, 2015.
- (2015) Learning like A Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
- Mao, J.¹ Xu, W.² Yang, Y.³ Wang, J.⁴ Huang, Z.⁵ Yuille, A.⁶

32
- 79959829092
- Recurrent neural network based language model
- Mikolov, Tomas, Karafiát, Martin, Burget, Lukas, Cernocky, Jan, and Khudanpur, Sanjeev. Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048, 2010.
- (2010) INTERSPEECH , pp. 1045-1048
- Mikolov, T.¹ Karafiát, M.² Burget, L.³ Cernocky, J.⁴ Khudanpur, S.⁵

33
- 80051643236
- Extensions of recurrent neural network language model
- Mikolov, Tomas, Kombrink, Stefan, Burget, Lukas, Cernocky, JH, and Khudanpur, Sanjeev. Extensions of recurrent neural network language model. In ICASSP, pp. 5528–5531, 2011.
- (2011) ICASSP , pp. 5528-5531
- Mikolov, T.¹ Kombrink, S.² Burget, L.³ Cernocky, J.H.⁴ Khudanpur, S.⁵

34
- 84898956512
- Distributed representations of words and phrases and their compositionality
- Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119, 2013.
- (2013) NIPS , pp. 3111-3119
- Mikolov, T.¹ Sutskever, I.² Chen, K.³ Corrado, G.S.⁴ Dean, J.⁵

35
- 85034832841
- Midge: Generating image descriptions from computer vision detections
- Mitchell, Margaret, Han, Xufeng, Dodge, Jesse, Mensch, Alyssa, Goyal, Amit, Berg, Alex, Yamaguchi, Kota, Berg, Tamara, Stratos, Karl, and Daumé III, Hal. Midge: Generating image descriptions from computer vision detections. In EACL, 2012.
- (2012) EACL
- Mitchell, M.¹ Han, X.² Dodge, J.³ Mensch, A.⁴ Goyal, A.⁵ Berg, A.⁶ Yamaguchi, K.⁷ Berg, T.⁸ Stratos, K.⁹ Daumé, H.¹⁰

36
- 34547970628
- Three new graphical models for statistical language modelling
- ACM
- Mnih, Andriy and Hinton, Geoffrey. Three new graphical models for statistical language modelling. In ICML, pp. 641–648. ACM, 2007.
- (2007) ICML , pp. 641-648
- Mnih, A.¹ Hinton, G.²

37
- 77956509090
- Rectified linear units improve restricted boltzmann machines
- Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807–814, 2010.
- (2010) ICML , pp. 807-814
- Nair, V.¹ Hinton, G.E.²

38
- 85133336275
- BLEU: A method for automatic evaluation of machine translation
- Papineni, Kishore, Roukos, Salim, Ward, Todd, and Zhu, Wei-Jing. Bleu: a method for automatic evaluation of machine translation. In ACL, pp. 311–318, 2002.
- (2002) ACL , pp. 311-318
- Papineni, K.¹ Roukos, S.² Ward, T.³ Zhu, W.-J.⁴

39
- 85090348677
- Collecting image annotations using amazon’s mechanical turk
- Rashtchian, Cyrus, Young, Peter, Hodosh, Micah, and Hockenmaier, Julia. Collecting image annotations using amazon’s mechanical turk. In NAACL-HLT workshop 2010, pp. 139–147, 2010.
- (2010) NAACL-HLT Workshop 2010 , pp. 139-147
- Rashtchian, C.¹ Young, P.² Hodosh, M.³ Hockenmaier, J.⁴

40
- 84921817164
- Learning representations by back-propagating errors
- Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning representations by back-propagating errors. Cognitive modeling, 1988.
- (1988) Cognitive Modeling
- Rumelhart, D.E.¹ Hinton, G.E.² Williams, R.J.³

41
- 84909978410
- Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.
- (2014) ImageNet Large Scale Visual Recognition Challenge
- Russakovsky, O.¹ Deng, J.² Su, H.³ Krause, J.⁴ Satheesh, S.⁵ Ma, S.⁶ Huang, Z.⁷ Karpathy, A.⁸ Khosla, A.⁹ Bernstein, M.¹⁰ Berg, A.C.¹¹ Fei-Fei, L.¹²

42
- 84925410541
- arXiv preprint
- Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition
- Simonyan, K.¹ Zisserman, A.²

43
- 84906925854
- Grounded compositional semantics for finding and describing images with sentences
- Socher, Richard, Le, Q, Manning, C, and Ng, A. Grounded compositional semantics for finding and describing images with sentences. In TACL, 2014.
- (2014) TACL
- Socher, R.¹ Le, Q.² Manning, C.³ Ng, A.⁴

44
- 84877724347
- Multimodal learning with deep boltzmann machines
- Srivastava, Nitish and Salakhutdinov, Ruslan. Multimodal learning with deep boltzmann machines. In NIPS, pp. 2222–2230, 2012.
- (2012) NIPS , pp. 2222-2230
- Srivastava, N.¹ Salakhutdinov, R.²

45
- 84928547704
- Sequence to sequence learning with neural networks
- Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112, 2014.
- (2014) NIPS , pp. 3104-3112
- Sutskever, I.¹ Vinyals, O.² Le, Q.V.V.³

46
- 84959197551
- arXiv preprint
- Vedantam, Ramakrishna, Zitnick, C Lawrence, and Parikh, Devi. Cider: Consensus-based image description evaluation. arXiv preprint arXiv:1411.5726, 2014.
- (2014) Cider: Consensus-Based Image Description Evaluation
- Vedantam, R.¹ Zitnick, C.L.² Parikh, D.³

47
- 84939821075
- arXiv preprint
- Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. arXiv preprint arXiv:1411.4555, 2014.
- (2014) Show and Tell: A Neural Image Caption Generator
- Vinyals, O.¹ Toshev, A.² Bengio, S.³ Erhan, D.⁴

48
- 84906494296
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
- Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In ACL, pp. 479–488, 2014.
- (2014) ACL , pp. 479-488
- Young, P.¹ Lai, A.² Hodosh, M.³ Hockenmaier, J.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.