SCOPUS 정보 검색 플랫폼

Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

Volumn 2017-January, Issue , 2017, Pages 1141-1150

Semantic compositional networks for visual captioning

(8) Gan, Zhe a Gan, Chuang b He, Xiaodong c Pu, Yunchen a Tran, Kenneth c Gao, Jianfeng c Carin, Lawrence a Deng, Li c

a DUKE UNIVERSITY (United States)

b TSINGHUA UNIVERSITY (China)

c MICROSOFT RESEARCH (United States)

Author keywords

[No Author keywords available]

Indexed keywords

COMPUTER VISION; SEMANTIC WEB; SEMANTICS;

BENCHMARK DATASETS; EVALUATION METRICS; IMAGE CAPTION; IMAGE CAPTIONING; SEMANTIC COMPOSITION; SEMANTIC CONCEPT; STATE-OF-THE-ART APPROACH; WEIGHT MATRICES;

LONG SHORT-TERM MEMORY;

EID: 85021786108 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/CVPR.2017.127 Document Type: Conference Paper

Times cited : (405)

References (57)

1
- 84986274522
- Deep compositional captioning: Describing novel object categories without paired training data
- L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without paired training data. In CVPR, 2016.
- (2016) CVPR
- Anne Hendricks, L.¹ Venugopalan, S.² Rohrbach, M.³ Mooney, R.⁴ Saenko, K.⁵ Darrell, T.⁶

2
- 85083953689
- Neural machine translation by jointly learning to align and translate
- D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- (2015) ICLR
- Bahdanau, D.¹ Cho, K.² Bengio, Y.³

3
- 85083954507
- Delving deeper into convolutional networks for learning video representations
- N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper into convolutional networks for learning video representations. In ICLR, 2016.
- (2016) ICLR
- Ballas, N.¹ Yao, L.² Pal, C.³ Courville, A.⁴

4
- 85116156579
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
- S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop, 2005.
- (2005) ACL Workshop
- Banerjee, S.¹ Lavie, A.²

5
- 84859089502
- Collecting highly parallel data for paraphrase evaluation
- D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
- (2011) ACL
- Chen, D.L.¹ Dolan, W.B.²

6
- 84952349295
- X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325, 2015.
- (2015) Microsoft Coco Captions: Data Collection and Evaluation Server
- Chen, X.¹ Fang, H.² Lin, T.-Y.³ Vedantam, R.⁴ Gupta, S.⁵ Dollár, P.⁶ Zitnick, C.L.⁷

7
- 84957029470
- Mind's eye: A recurrent visual representation for image caption generation
- X. Chen and C. Lawrence Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In CVPR, 2015.
- (2015) CVPR
- Chen, X.¹ Lawrence Zitnick, C.²

8
- 84961291190
- Learning phrase representations using rnn encoder-decoder for statistical machine translation
- K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
- (2014) EMNLP
- Cho, K.¹ Van Merriënboer, B.² Gulcehre, C.³ Bahdanau, D.⁴ Bougares, F.⁵ Schwenk, H.⁶ Bengio, Y.⁷

9
- 84944096380
- Language models for image captioning: The quirks and what works
- J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M. Mitchell. Language models for image captioning: The quirks and what works. In ACL, 2015.
- (2015) ACL
- Devlin, J.¹ Cheng, H.² Fang, H.³ Gupta, S.⁴ Deng, L.⁵ He, X.⁶ Zweig, G.⁷ Mitchell, M.⁸

10
- 84959236502
- Long-term recurrent convolutional networks for visual recognition and description
- J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
- (2015) CVPR
- Donahue, J.¹ Anne Hendricks, L.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

11
- 84994631269
- Early embedding and late reranking for video captioning
- J. Dong, X. Li, W. Lan, Y. Huo, and C. G. Snoek. Early embedding and late reranking for video captioning. In ACMMM, 2016.
- (2016) ACMMM
- Dong, J.¹ Li, X.² Lan, W.³ Huo, Y.⁴ Snoek, C.G.⁵

12
- 26444565569
- Finding structure in time
- J. L. Elman. Finding structure in time. Cognitive science, 1990.
- (1990) Cognitive Science
- Elman, J.L.¹

13
- 84959250180
- From captions to visual concepts and back
- H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In CVPR, 2015.
- (2015) CVPR
- Fang, H.¹ Gupta, S.² Iandola, F.³ Srivastava, R.K.⁴ Deng, L.⁵ Dollár, P.⁶ Gao, J.⁷ He, X.⁸ Mitchell, M.⁹ Platt, J.C.¹⁰

14
- 85044213495
- Stylenet: Generating attractive visual captions with styles
- C. Gan, Z. Gan, X. He, J. Gao, and L. Deng. Stylenet: Generating attractive visual captions with styles. In CVPR, 2017.
- (2017) CVPR
- Gan, C.¹ Gan, Z.² He, X.³ Gao, J.⁴ Deng, L.⁵

15
- 84986281512
- Learning attributes equals multi-source domain generalization
- C. Gan, T. Yang, and B. Gong. Learning attributes equals multi-source domain generalization. In CVPR, 2016.
- (2016) CVPR
- Gan, C.¹ Yang, T.² Gong, B.³

16
- 84986274465
- Deep residual learning for image recognition
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
- (2016) CVPR
- He, K.¹ Zhang, X.² Ren, S.³ Sun, J.⁴

17
- 0031573117
- Long short-term memory
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
- (1997) Neural Computation
- Hochreiter, S.¹ Schmidhuber, J.²

18
- 84973917813
- Guiding long-short term memory for image caption generation
- X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars. Guiding long-short term memory for image caption generation. In ICCV, 2015.
- (2015) ICCV
- Jia, X.¹ Gavves, E.² Fernando, B.³ Tuytelaars, T.⁴

19
- 84986312327
- J. Jin, K. Fu, R. Cui, F. Sha, and C. Zhang. Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272, 2015.
- (2015) Aligning where to See and What to Tell: Image Caption with Region-based Attention and Scene Factorization
- Jin, J.¹ Fu, K.² Cui, R.³ Sha, F.⁴ Zhang, C.⁵

20
- 84946734827
- Deep visual-semantic alignments for generating image descriptions
- A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
- (2015) CVPR
- Karpathy, A.¹ Fei-Fei, L.²

21
- 84911364368
- Large-scale video classification with convolutional neural networks
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- (2014) CVPR
- Karpathy, A.¹ Toderici, G.² Shetty, S.³ Leung, T.⁴ Sukthankar, R.⁵ Fei-Fei, L.⁶

22
- 85083951076
- Adam: A method for stochastic optimization
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- (2015) ICLR
- Kingma, D.¹ Ba, J.²

23
- 84919921461
- Multimodal neural language models
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Multimodal neural language models. In ICML, 2014.
- (2014) ICML
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.S.³

24
- 84944113729
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539, 2014.
- (2014) Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.S.³

25
- 84937834963
- A multiplicative model for learning distributed text-based attribute representations
- R. Kiros, R. Zemel, and R. R. Salakhutdinov. A multiplicative model for learning distributed text-based attribute representations. In NIPS, 2014.
- (2014) NIPS
- Kiros, R.¹ Zemel, R.² Salakhutdinov, R.R.³

26
- 78650200194
- Rouge: A package for automatic evaluation of summaries
- C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In ACL workshop, 2004.
- (2004) ACL Workshop
- Lin, C.-Y.¹

27
- 84937834115
- Microsoft coco: Common objects in context
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
- (2014) ECCV
- Lin, T.-Y.¹ Maire, M.² Belongie, S.³ Hays, J.⁴ Perona, P.⁵ Ramanan, D.⁶ Dollár, P.⁷ Zitnick, C.L.⁸

28
- 85040306674
- C. Liu, J. Mao, F. Sha, and A. Yuille. Attention correctness in neural image captioning. arXiv:1605.09553, 2016.
- (2016) Attention Correctness in Neural Image Captioning
- Liu, C.¹ Mao, J.² Sha, F.³ Yuille, A.⁴

29
- 85083950512
- Deep captioning with multimodal recurrent neural networks (m-rnn)
- J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR, 2015.
- (2015) ICLR
- Mao, J.¹ Xu, W.² Yang, Y.³ Wang, J.⁴ Huang, Z.⁵ Yuille, A.⁶

30
- 34948828582
- Unsupervised learning of image transformations
- R. Memisevic and G. Hinton. Unsupervised learning of image transformations. In CVPR, 2007.
- (2007) CVPR
- Memisevic, R.¹ Hinton, G.²

31
- 84898956512
- Distributed representations of words and phrases and their compositionality
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
- (2013) NIPS
- Mikolov, T.¹ Sutskever, I.² Chen, K.³ Corrado, G.S.⁴ Dean, J.⁵

32
- 84986332702
- Jointly modeling embedding and translation to bridge video and language
- Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, 2016.
- (2016) CVPR
- Pan, Y.¹ Mei, T.² Yao, T.³ Li, H.⁴ Rui, Y.⁵

33
- 85133336275
- Bleu: A method for automatic evaluation of machine translation
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
- (2002) ACL
- Papineni, K.¹ Roukos, S.² Ward, T.³ Zhu, W.-J.⁴

34
- 85018916536
- Variational autoencoder for deep learning of images, labels and captions
- Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep learning of images, labels and captions. In NIPS, 2016.
- (2016) NIPS
- Pu, Y.¹ Gan, Z.² Henao, R.³ Yuan, X.⁴ Li, C.⁵ Stevens, A.⁶ Carin, L.⁷

35
- 85031894741
- Y. Pu, M. R. Min, Z. Gan, and L. Carin. Adaptive feature abstraction for translating video to language. arXiv:1611.07837, 2016.
- (2016) Adaptive Feature Abstraction for Translating Video to Language
- Pu, Y.¹ Min, M.R.² Gan, Z.³ Carin, L.⁴

36
- 84947041871
- Imagenet large scale visual recognition challenge
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
- (2015) IJCV
- Russakovsky, O.¹ Deng, J.² Su, H.³ Krause, J.⁴ Satheesh, S.⁵ Ma, S.⁶ Huang, Z.⁷ Karpathy, A.⁸ Khosla, A.⁹ Bernstein, M.¹⁰

37
- 84906925854
- Grounded compositional semantics for finding and describing images with sentences
- R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. TACL, 2014.
- (2014) TACL
- Socher, R.¹ Karpathy, A.² Le, Q.V.³ Manning, C.D.⁴ Ng, A.Y.⁵

38
- 84998827890
- Factored temporal sigmoid belief networks for sequence learning
- J. Song, Z. Gan, and L. Carin. Factored temporal sigmoid belief networks for sequence learning. In ICML, 2016.
- (2016) ICML
- Song, J.¹ Gan, Z.² Carin, L.³

39
- 80053459857
- Generating text with recurrent neural networks
- I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In ICML, 2011.
- (2011) ICML
- Sutskever, I.¹ Martens, J.² Hinton, G.E.³

40
- 84928547704
- Sequence to sequence learning with neural networks
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- (2014) NIPS
- Sutskever, I.¹ Vinyals, O.² Le, Q.V.³

41
- 71149118574
- Factored conditional restricted boltzmann machines for modeling motion style
- G. W. Taylor and G. E. Hinton. Factored conditional restricted boltzmann machines for modeling motion style. In ICML, 2009.
- (2009) ICML
- Taylor, G.W.¹ Hinton, G.E.²

42
- 84979557463
- arXiv: 1605.02688
- Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv: 1605.02688, 2016.
- (2016) Theano: A Python Framework for Fast Computation of Mathematical Expressions

43
- 84973865953
- Learning spatiotemporal features with 3d convolutional networks
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
- (2015) ICCV
- Tran, D.¹ Bourdev, L.² Fergus, R.³ Torresani, L.⁴ Paluri, M.⁵

44
- 85010205139
- Rich image captioning in the wild
- K. Tran, X. He, L. Zhang, J. Sun, C. Carapcea, C. Thrasher, C. Buehler, and C. Sienkiewicz. Rich image captioning in the wild. In CVPR Workshops, 2016.
- (2016) CVPR Workshops
- Tran, K.¹ He, X.² Zhang, L.³ Sun, J.⁴ Carapcea, C.⁵ Thrasher, C.⁶ Buehler, C.⁷ Sienkiewicz, C.⁸

45
- 84956980995
- Cider: Consensus-based image description evaluation
- R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015.
- (2015) CVPR
- Vedantam, R.¹ Lawrence Zitnick, C.² Parikh, D.³

46
- 84973882730
- Sequence to sequence-video to text
- S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In ICCV, 2015.
- (2015) ICCV
- Venugopalan, S.¹ Rohrbach, M.² Donahue, J.³ Mooney, R.⁴ Darrell, T.⁵ Saenko, K.⁶

47
- 84959876769
- Translating videos to natural language using deep recurrent neural networks
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACL, 2015.
- (2015) NAACL
- Venugopalan, S.¹ Xu, H.² Donahue, J.³ Rohrbach, M.⁴ Mooney, R.⁵ Saenko, K.⁶

48
- 84946747440
- Show and tell: A neural image caption generator
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
- (2015) CVPR
- Vinyals, O.¹ Toshev, A.² Bengio, S.³ Erhan, D.⁴

49
- 84986301177
- What value do explicit high level concepts have in vision to language problems?
- Q. Wu, C. Shen, L. Liu, A. Dick, and A. v. d. Hengel. What value do explicit high level concepts have in vision to language problems? In CVPR, 2016.
- (2016) CVPR
- Wu, Q.¹ Shen, C.² Liu, L.³ Dick, A.⁴ Hengel, A.V.D.⁵

50
- 85018912617
- On multiplicative integration with recurrent neural networks
- Y. Wu, S. Zhang, Y. Zhang, Y. Bengio, and R. Salakhutdinov. On multiplicative integration with recurrent neural networks. In NIPS, 2016.
- (2016) NIPS
- Wu, Y.¹ Zhang, S.² Zhang, Y.³ Bengio, Y.⁴ Salakhutdinov, R.⁵

51
- 84986260127
- Msr-vtt: A large video description dataset for bridging video and language
- J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
- (2016) CVPR
- Xu, J.¹ Mei, T.² Yao, T.³ Rui, Y.⁴

52
- 84970002232
- Show, attend and tell: Neural image caption generation with visual attention
- K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
- (2015) ICML
- Xu, K.¹ Ba, J.² Kiros, R.³ Cho, K.⁴ Courville, A.⁵ Salakhutdinov, R.⁶ Zemel, R.S.⁷ Bengio, Y.⁸

53
- 85018878538
- Review networks for caption generation
- Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen. Review networks for caption generation. In NIPS, 2016.
- (2016) NIPS
- Yang, Z.¹ Yuan, Y.² Wu, Y.³ Salakhutdinov, R.⁴ Cohen, W.W.⁵

54
- 84986317307
- Image captioning with semantic attention
- Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In CVPR, 2016.
- (2016) CVPR
- You, Q.¹ Jin, H.² Wang, Z.³ Fang, C.⁴ Luo, J.⁵

55
- 84906494296
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2014.
- (2014) TACL
- Young, P.¹ Lai, A.² Hodosh, M.³ Hockenmaier, J.⁴

56
- 84986275061
- Video paragraph captioning using hierarchical recurrent neural networks
- H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR, 2016.
- (2016) CVPR
- Yu, H.¹ Wang, J.² Huang, Z.³ Yang, Y.⁴ Xu, W.⁵

57
- 84944053926
- W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv:1409.2329, 2014.
- (2014) Recurrent Neural Network Regularization
- Zaremba, W.¹ Sutskever, I.² Vinyals, O.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.