SCOPUS 정보 검색 플랫폼

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Volumn 2016-December, Issue , 2016, Pages 5288-5296

MSR-VTT: A large video description dataset for bridging video and language

(4) Xu, Jun a Mei, Tao a Yao, Ting a Rui, Yong a

a MICROSOFT RESEARCH (United States)

Author keywords

[No Author keywords available]

Indexed keywords

MULTIMEDIA SYSTEMS; NATURAL LANGUAGE PROCESSING SYSTEMS; PATTERN RECOGNITION; SEARCH ENGINES;

BENCHMARK DATASETS; COMPUTER VISION ALGORITHMS; GENERALIZATION CAPABILITY; MOTION REPRESENTATION; NATURAL LANGUAGES; NETWORK-BASED APPROACH; STATE OF THE ART; VIDEO UNDERSTANDING;

COMPUTER VISION;

EID: 84986260127 PISSN: 10636919 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/CVPR.2016.571 Document Type: Conference Paper

Times cited : (2111)

References (42)

1
- 85116156579
- METEOR: An automatic metric for mt evaluation with improved correlation with human judgments
- S. Banerjee and A. Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of ACL Workshop, pages 65-72, 2005.
- (2005) Proceedings of ACL Workshop , pp. 65-72
- Banerjee, S.¹ Lavie, A.²

2
- 84885996388
- Video in sentences out
- A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, et al. Video in sentences out. Proceedings of UAI, 2012.
- (2012) Proceedings of UAI
- Barbu, A.¹ Bridge, A.² Burchill, Z.³ Coroian, D.⁴ Dickinson, S.⁵ Fidler, S.⁶ Michaux, A.⁷ Mussman, S.⁸ Narayanaswamy, S.⁹ Salvi, D.¹⁰

3
- 84859089502
- Collecting highly parallel data for paraphrase evaluation
- D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of ACL, pages 190-200, 2011.
- (2011) Proceedings of ACL , pp. 190-200
- Chen, D.L.¹ Dolan, W.B.²

4
- 84957029470
- Mind's Eye: A recurrent visual representation for image caption generation
- X. Chen and C. L. Zitnick. Mind's Eye: A recurrent visual representation for image caption generation. In Proceedings of CVPR, 2015.
- (2015) Proceedings of CVPR
- Chen, X.¹ Zitnick, C.L.²

5
- 84887345951
- A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching
- P. Das, C. Xu, R. F. Doell, and J. J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of CVPR, pages 2634-2641, 2013.
- (2013) Proceedings of CVPR , pp. 2634-2641
- Das, P.¹ Xu, C.² Doell, R.F.³ Corso, J.J.⁴

6
- 72249100259
- ImageNet: A large-scale hierarchical image database
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of CVPR, pages 248-255, 2009.
- (2009) Proceedings of CVPR , pp. 248-255
- Deng, J.¹ Dong, W.² Socher, R.³ Li, L.-J.⁴ Li, K.⁵ Fei-Fei, L.⁶

7
- 84959236502
- Long-term recurrent convolutional networks for visual recognition and description
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of CVPR, 2015.
- (2015) Proceedings of CVPR
- Donahue, J.¹ Hendricks, L.A.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

8
- 84959250180
- From captions to visual concepts and back
- H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al. From captions to visual concepts and back. In Proceedings of CVPR, 2015.
- (2015) Proceedings of CVPR
- Fang, H.¹ Gupta, S.² Iandola, F.³ Srivastava, R.⁴ Deng, L.⁵ Dollár, P.⁶ Gao, J.⁷ He, X.⁸ Mitchell, M.⁹ Platt, J.¹⁰

9
- 78149311145
- Every picture tells a story: Generating sentences from images
- A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In Proceedings of ECCV, pages 15-29, 2010.
- (2010) Proceedings of ECCV , pp. 15-29
- Farhadi, A.¹ Hejrati, M.² Sadeghi, M.A.³ Young, P.⁴ Rashtchian, C.⁵ Hockenmaier, J.⁶ Forsyth, D.⁷

10
- 84870183903
- 3D convolutional neural networks for human action recognition
- S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 35(1):221-231, 2013.
- (2013) IEEE Trans. on Pattern Analysis and Machine Intelligence , vol.35 , Issue.1 , pp. 221-231
- Ji, S.¹ Xu, W.² Yang, M.³ Yu, K.⁴

11
- 84946734827
- Deep visual-semantic alignments for generating image descriptions
- A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of CVPR, 2015.
- (2015) Proceedings of CVPR
- Karpathy, A.¹ Fei-Fei, L.²

12
- 84911364368
- Large-scale video classification with convolutional neural networks
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of CVPR, pages 1725-1732, 2014.
- (2014) Proceedings of CVPR , pp. 1725-1732
- Karpathy, A.¹ Toderici, G.² Shetty, S.³ Leung, T.⁴ Sukthankar, R.⁵ Fei-Fei, L.⁶

13
- 84929363334
- Multimodal neural language models
- R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In Proceedings of ICML, pages 595-603, 2014.
- (2014) Proceedings of ICML , pp. 595-603
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.³

14
- 84952349298
- Unifying visualsemantic embeddings with multimodal neural language models
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visualsemantic embeddings with multimodal neural language models. TACL, 2015.
- (2015) TACL
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.S.³

15
- 0036843382
- Natural language description of human activities from video images based on concept hierarchy of actions
- A. Kojima, T. Tamura, and K. Fukunaga. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 50(2):171-184, 2002.
- (2002) International Journal of Computer Vision , vol.50 , Issue.2 , pp. 171-184
- Kojima, A.¹ Tamura, T.² Fukunaga, K.³

16
- 84893398951
- Generating natural-language video descriptions using text-mined knowledge
- N. Krishnamoorthy, K. S. Girish Malkarnenkar, Raymond J. Mooney, and S. Guadarrama. Generating natural-language video descriptions using text-mined knowledge. In Proceedings of AAAI, 2013.
- (2013) Proceedings of AAAI
- Krishnamoorthy, N.¹ Girish Malkarnenkar, K.S.² Mooney, R.J.³ Guadarrama, S.⁴

17
- 84876231242
- Imagenet classification with deep convolutional neural networks
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of NIPS, pages 1097-1105, 2012.
- (2012) Proceedings of NIPS , pp. 1097-1105
- Krizhevsky, A.¹ Sutskever, I.² Hinton, G.E.³

18
- 84887601544
- Babytalk: Understanding and generating simple image descriptions
- G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 35(12):2891-2903, 2013.
- (2013) IEEE Trans. on Pattern Analysis and Machine Intelligence , vol.35 , Issue.12 , pp. 2891-2903
- Kulkarni, G.¹ Premraj, V.² Ordonez, V.³ Dhar, S.⁴ Li, S.⁵ Choi, Y.⁶ Berg, A.C.⁷ Berg, T.⁸

19
- 84970028761
- Phrase-based image captioning
- R. Lebret, P. O. Pinheiro, and R. Collobert. Phrase-based image captioning. Proceedings of ICML, 2015.
- (2015) Proceedings of ICML
- Lebret, R.¹ Pinheiro, P.O.² Collobert, R.³

20
- 84862279067
- Composing simple image descriptions using web-scale N-grams
- S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale N-grams. In Proceedings of International Conference on Computational Natural Language Learning, pages 220-228, 2011.
- (2011) Proceedings of International Conference on Computational Natural Language Learning , pp. 220-228
- Li, S.¹ Kulkarni, G.² Berg, T.L.³ Berg, A.C.⁴ Choi, Y.⁵

21
- 84906493406
- Microsoft COCO: Common objects in context
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of ECCV, pages 740-755, 2014.
- (2014) Proceedings of ECCV , pp. 740-755
- Lin, T.-Y.¹ Maire, M.² Belongie, S.³ Hays, J.⁴ Perona, P.⁵ Ramanan, D.⁶ Dollár, P.⁷ Zitnick, C.L.⁸

22
- 85083950512
- Deep captioning with multimodal recurrent neural networks (m-rnn)
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In Proceedings of ICLR, 2015.
- (2015) Proceedings of ICLR
- Mao, J.¹ Xu, W.² Yang, Y.³ Wang, J.⁴ Yuille, A.L.⁵

23
- 84893956152
- Multimedia search reranking: A literature survey
- T. Mei, Y. Rui, S. Li, and Q. Tian. Multimedia search reranking: A literature survey. ACM Computing Surveys (CSUR), 46(3):38, 2014.
- (2014) ACM Computing Surveys (CSUR) , vol.46 , Issue.3 , pp. 38
- Mei, T.¹ Rui, Y.² Li, S.³ Tian, Q.⁴

24
- 85133336275
- BLEU: A method for automatic evaluation of machine translation
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL, pages 311-318, 2002.
- (2002) Proceedings of ACL , pp. 311-318
- Papineni, K.¹ Roukos, S.² Ward, T.³ Zhu, W.-J.⁴

25
- 84898785648
- Grounding action descriptions in videos
- M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25-36, 2013.
- (2013) Transactions of the Association for Computational Linguistics , vol.1 , pp. 25-36
- Regneri, M.¹ Rohrbach, M.² Wetzel, D.³ Thater, S.⁴ Schiele, B.⁵ Pinkal, M.⁶

26
- 84908670256
- Coherent multi-sentence video description with variable level of detail
- A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. Pattern Recognition, pages 184-195, 2014.
- (2014) Pattern Recognition , pp. 184-195
- Rohrbach, A.¹ Rohrbach, M.² Qiu, W.³ Friedrich, A.⁴ Pinkal, M.⁵ Schiele, B.⁶

27
- 84959211977
- A dataset for movie description
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. Proceedings of CVPR, 2015.
- (2015) Proceedings of CVPR
- Rohrbach, A.¹ Rohrbach, M.² Tandon, N.³ Schiele, B.⁴

28
- 84898775239
- Translating video content to natural language descriptions
- M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In Proceedings of ICCV, pages 433-440, 2013.
- (2013) Proceedings of ICCV , pp. 433-440
- Rohrbach, M.¹ Qiu, W.² Titov, I.³ Thater, S.⁴ Pinkal, M.⁵ Schiele, B.⁶

29
- 85083953063
- Very deep convolutional networks for large-scale image recognition
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of ICLR, 2015.
- (2015) Proceedings of ICLR
- Simonyan, K.¹ Zisserman, A.²

30
- 84973888835
- Automatic concept discovery from parallel text and visual corpora
- C. Sun, C. Gan, and R. Nevatia. Automatic concept discovery from parallel text and visual corpora. In ICCV, pages 2596-2604, 2015.
- (2015) ICCV , pp. 2596-2604
- Sun, C.¹ Gan, C.² Nevatia, R.³

31
- 84937522268
- Going deeper with convolutions
- C. Szegedy,W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of CVPR, 2015.
- (2015) Proceedings of CVPR
- Szegedy, C.¹ Liu, W.² Jia, Y.³ Sermanet, P.⁴ Reed, S.⁵ Anguelov, D.⁶ Erhan, D.⁷ Vanhoucke, V.⁸ Rabinovich, A.⁹

32
- 84959246420
- arXiv:1503.01070
- A. Torabi, C. J. Pal, H. Larochelle, and A. C. Courville. Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070, 2015.
- (2015) Using Descriptive Video Services to Create A Large Data Source for Video Annotation Research
- Torabi, A.¹ Pal, C.J.² Larochelle, H.³ Courville, A.C.⁴

33
- 84969504307
- C3D: Generic features for video analysis
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3D: generic features for video analysis. In Proceedings of ICCV, 2015.
- (2015) Proceedings of ICCV
- Tran, D.¹ Bourdev, L.² Fergus, R.³ Torresani, L.⁴ Paluri, M.⁵

34
- 84973882730
- Sequence to sequence-video to text
- S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In Proceedings of ICCV, 2015.
- (2015) Proceedings of ICCV
- Venugopalan, S.¹ Rohrbach, M.² Donahue, J.³ Mooney, R.⁴ Darrell, T.⁵ Saenko, K.⁶

35
- 84959876769
- Translating videos to natural language using deep recurrent neural networks
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In Proceedings of ACL, 2015.
- (2015) Proceedings of ACL
- Venugopalan, S.¹ Xu, H.² Donahue, J.³ Rohrbach, M.⁴ Mooney, R.⁵ Saenko, K.⁶

36
- 84946747440
- Show and Tell: A neural image caption generator
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and Tell: A neural image caption generator. In Proceedings of CVPR, 2015.
- (2015) Proceedings of CVPR
- Vinyals, O.¹ Toshev, A.² Bengio, S.³ Erhan, D.⁴

37
- 84970002232
- Show, attend and tell: Neural image caption generation with visual attention
- K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of ICML, 2015.
- (2015) Proceedings of ICML
- Xu, K.¹ Ba, J.² Kiros, R.³ Courville, A.⁴ Salakhutdinov, R.⁵ Zemel, R.⁶ Bengio, Y.⁷

38
- 80053258778
- Corpus-guided sentence generation of natural images
- Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In Proceedings of Intl Conference on Empirical Methods in Natural Language Processing, pages 444-454, 2011.
- (2011) Proceedings of Intl Conference on Empirical Methods in Natural Language Processing , pp. 444-454
- Yang, Y.¹ Teo, C.L.² Daumé, H.³ Aloimonos, Y.⁴

39
- 84973884896
- Describing videos by exploiting temporal structure
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In Proceedings of ICCV, 2015.
- (2015) Proceedings of ICCV
- Yao, L.¹ Torabi, A.² Cho, K.³ Ballas, N.⁴ Pal, C.⁵ Larochelle, H.⁶ Courville, A.⁷

40
- 84887428782
- Annotation for free: Video tagging by mining user search behavior
- ACM
- T. Yao, T. Mei, C.-W. Ngo, and S. Li. Annotation for free: Video tagging by mining user search behavior. In Proceedings of the 21st ACM international conference on Multimedia, pages 977-986. ACM, 2013.
- (2013) Proceedings of the 21st ACM International Conference on Multimedia , pp. 977-986
- Yao, T.¹ Mei, T.² Ngo, C.-W.³ Li, S.⁴

41
- 84906494296
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67-78, 2014.
- (2014) Transactions of the Association for Computational Linguistics , vol.2 , pp. 67-78
- Young, P.¹ Lai, A.² Hodosh, M.³ Hockenmaier, J.⁴

42
- 37848999908
- Building a comprehensive ontology to refine video concept detection
- ACM
- Z.-J. Zha, T. Mei, Z. Wang, and X.-S. Hua. Building a comprehensive ontology to refine video concept detection. In Proceedings of the international workshop on Workshop on multimedia information retrieval, pages 227-236. ACM, 2007.
- (2007) Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval , pp. 227-236
- Zha, Z.-J.¹ Mei, T.² Wang, Z.³ Hua, X.-S.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.