-
1
-
-
85116156579
-
METEOR: An automatic metric for mt evaluation with improved correlation with human judgments
-
S. Banerjee and A. Lavie. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of ACL Workshop, pages 65-72, 2005.
-
(2005)
Proceedings of ACL Workshop
, pp. 65-72
-
-
Banerjee, S.1
Lavie, A.2
-
2
-
-
84885996388
-
Video in sentences out
-
A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, et al. Video in sentences out. Proceedings of UAI, 2012.
-
(2012)
Proceedings of UAI
-
-
Barbu, A.1
Bridge, A.2
Burchill, Z.3
Coroian, D.4
Dickinson, S.5
Fidler, S.6
Michaux, A.7
Mussman, S.8
Narayanaswamy, S.9
Salvi, D.10
-
3
-
-
84859089502
-
Collecting highly parallel data for paraphrase evaluation
-
D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of ACL, pages 190-200, 2011.
-
(2011)
Proceedings of ACL
, pp. 190-200
-
-
Chen, D.L.1
Dolan, W.B.2
-
4
-
-
84957029470
-
Mind's Eye: A recurrent visual representation for image caption generation
-
X. Chen and C. L. Zitnick. Mind's Eye: A recurrent visual representation for image caption generation. In Proceedings of CVPR, 2015.
-
(2015)
Proceedings of CVPR
-
-
Chen, X.1
Zitnick, C.L.2
-
5
-
-
84887345951
-
A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching
-
P. Das, C. Xu, R. F. Doell, and J. J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of CVPR, pages 2634-2641, 2013.
-
(2013)
Proceedings of CVPR
, pp. 2634-2641
-
-
Das, P.1
Xu, C.2
Doell, R.F.3
Corso, J.J.4
-
6
-
-
72249100259
-
ImageNet: A large-scale hierarchical image database
-
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of CVPR, pages 248-255, 2009.
-
(2009)
Proceedings of CVPR
, pp. 248-255
-
-
Deng, J.1
Dong, W.2
Socher, R.3
Li, L.-J.4
Li, K.5
Fei-Fei, L.6
-
7
-
-
84959236502
-
Long-term recurrent convolutional networks for visual recognition and description
-
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of CVPR, 2015.
-
(2015)
Proceedings of CVPR
-
-
Donahue, J.1
Hendricks, L.A.2
Guadarrama, S.3
Rohrbach, M.4
Venugopalan, S.5
Saenko, K.6
Darrell, T.7
-
8
-
-
84959250180
-
From captions to visual concepts and back
-
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al. From captions to visual concepts and back. In Proceedings of CVPR, 2015.
-
(2015)
Proceedings of CVPR
-
-
Fang, H.1
Gupta, S.2
Iandola, F.3
Srivastava, R.4
Deng, L.5
Dollár, P.6
Gao, J.7
He, X.8
Mitchell, M.9
Platt, J.10
-
9
-
-
78149311145
-
Every picture tells a story: Generating sentences from images
-
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In Proceedings of ECCV, pages 15-29, 2010.
-
(2010)
Proceedings of ECCV
, pp. 15-29
-
-
Farhadi, A.1
Hejrati, M.2
Sadeghi, M.A.3
Young, P.4
Rashtchian, C.5
Hockenmaier, J.6
Forsyth, D.7
-
10
-
-
84870183903
-
3D convolutional neural networks for human action recognition
-
S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural networks for human action recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 35(1):221-231, 2013.
-
(2013)
IEEE Trans. on Pattern Analysis and Machine Intelligence
, vol.35
, Issue.1
, pp. 221-231
-
-
Ji, S.1
Xu, W.2
Yang, M.3
Yu, K.4
-
11
-
-
84946734827
-
Deep visual-semantic alignments for generating image descriptions
-
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of CVPR, 2015.
-
(2015)
Proceedings of CVPR
-
-
Karpathy, A.1
Fei-Fei, L.2
-
12
-
-
84911364368
-
Large-scale video classification with convolutional neural networks
-
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of CVPR, pages 1725-1732, 2014.
-
(2014)
Proceedings of CVPR
, pp. 1725-1732
-
-
Karpathy, A.1
Toderici, G.2
Shetty, S.3
Leung, T.4
Sukthankar, R.5
Fei-Fei, L.6
-
14
-
-
84952349298
-
Unifying visualsemantic embeddings with multimodal neural language models
-
R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visualsemantic embeddings with multimodal neural language models. TACL, 2015.
-
(2015)
TACL
-
-
Kiros, R.1
Salakhutdinov, R.2
Zemel, R.S.3
-
15
-
-
0036843382
-
Natural language description of human activities from video images based on concept hierarchy of actions
-
A. Kojima, T. Tamura, and K. Fukunaga. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 50(2):171-184, 2002.
-
(2002)
International Journal of Computer Vision
, vol.50
, Issue.2
, pp. 171-184
-
-
Kojima, A.1
Tamura, T.2
Fukunaga, K.3
-
17
-
-
84876231242
-
Imagenet classification with deep convolutional neural networks
-
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of NIPS, pages 1097-1105, 2012.
-
(2012)
Proceedings of NIPS
, pp. 1097-1105
-
-
Krizhevsky, A.1
Sutskever, I.2
Hinton, G.E.3
-
18
-
-
84887601544
-
Babytalk: Understanding and generating simple image descriptions
-
G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. on Pattern Analysis and Machine Intelligence, 35(12):2891-2903, 2013.
-
(2013)
IEEE Trans. on Pattern Analysis and Machine Intelligence
, vol.35
, Issue.12
, pp. 2891-2903
-
-
Kulkarni, G.1
Premraj, V.2
Ordonez, V.3
Dhar, S.4
Li, S.5
Choi, Y.6
Berg, A.C.7
Berg, T.8
-
20
-
-
84862279067
-
Composing simple image descriptions using web-scale N-grams
-
S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale N-grams. In Proceedings of International Conference on Computational Natural Language Learning, pages 220-228, 2011.
-
(2011)
Proceedings of International Conference on Computational Natural Language Learning
, pp. 220-228
-
-
Li, S.1
Kulkarni, G.2
Berg, T.L.3
Berg, A.C.4
Choi, Y.5
-
21
-
-
84906493406
-
Microsoft COCO: Common objects in context
-
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of ECCV, pages 740-755, 2014.
-
(2014)
Proceedings of ECCV
, pp. 740-755
-
-
Lin, T.-Y.1
Maire, M.2
Belongie, S.3
Hays, J.4
Perona, P.5
Ramanan, D.6
Dollár, P.7
Zitnick, C.L.8
-
22
-
-
85083950512
-
Deep captioning with multimodal recurrent neural networks (m-rnn)
-
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In Proceedings of ICLR, 2015.
-
(2015)
Proceedings of ICLR
-
-
Mao, J.1
Xu, W.2
Yang, Y.3
Wang, J.4
Yuille, A.L.5
-
23
-
-
84893956152
-
Multimedia search reranking: A literature survey
-
T. Mei, Y. Rui, S. Li, and Q. Tian. Multimedia search reranking: A literature survey. ACM Computing Surveys (CSUR), 46(3):38, 2014.
-
(2014)
ACM Computing Surveys (CSUR)
, vol.46
, Issue.3
, pp. 38
-
-
Mei, T.1
Rui, Y.2
Li, S.3
Tian, Q.4
-
24
-
-
85133336275
-
BLEU: A method for automatic evaluation of machine translation
-
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL, pages 311-318, 2002.
-
(2002)
Proceedings of ACL
, pp. 311-318
-
-
Papineni, K.1
Roukos, S.2
Ward, T.3
Zhu, W.-J.4
-
25
-
-
84898785648
-
Grounding action descriptions in videos
-
M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and M. Pinkal. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25-36, 2013.
-
(2013)
Transactions of the Association for Computational Linguistics
, vol.1
, pp. 25-36
-
-
Regneri, M.1
Rohrbach, M.2
Wetzel, D.3
Thater, S.4
Schiele, B.5
Pinkal, M.6
-
26
-
-
84908670256
-
Coherent multi-sentence video description with variable level of detail
-
A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. Pattern Recognition, pages 184-195, 2014.
-
(2014)
Pattern Recognition
, pp. 184-195
-
-
Rohrbach, A.1
Rohrbach, M.2
Qiu, W.3
Friedrich, A.4
Pinkal, M.5
Schiele, B.6
-
28
-
-
84898775239
-
Translating video content to natural language descriptions
-
M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In Proceedings of ICCV, pages 433-440, 2013.
-
(2013)
Proceedings of ICCV
, pp. 433-440
-
-
Rohrbach, M.1
Qiu, W.2
Titov, I.3
Thater, S.4
Pinkal, M.5
Schiele, B.6
-
29
-
-
85083953063
-
Very deep convolutional networks for large-scale image recognition
-
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of ICLR, 2015.
-
(2015)
Proceedings of ICLR
-
-
Simonyan, K.1
Zisserman, A.2
-
30
-
-
84973888835
-
Automatic concept discovery from parallel text and visual corpora
-
C. Sun, C. Gan, and R. Nevatia. Automatic concept discovery from parallel text and visual corpora. In ICCV, pages 2596-2604, 2015.
-
(2015)
ICCV
, pp. 2596-2604
-
-
Sun, C.1
Gan, C.2
Nevatia, R.3
-
31
-
-
84937522268
-
Going deeper with convolutions
-
C. Szegedy,W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of CVPR, 2015.
-
(2015)
Proceedings of CVPR
-
-
Szegedy, C.1
Liu, W.2
Jia, Y.3
Sermanet, P.4
Reed, S.5
Anguelov, D.6
Erhan, D.7
Vanhoucke, V.8
Rabinovich, A.9
-
33
-
-
84969504307
-
C3D: Generic features for video analysis
-
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3D: generic features for video analysis. In Proceedings of ICCV, 2015.
-
(2015)
Proceedings of ICCV
-
-
Tran, D.1
Bourdev, L.2
Fergus, R.3
Torresani, L.4
Paluri, M.5
-
34
-
-
84973882730
-
Sequence to sequence-video to text
-
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In Proceedings of ICCV, 2015.
-
(2015)
Proceedings of ICCV
-
-
Venugopalan, S.1
Rohrbach, M.2
Donahue, J.3
Mooney, R.4
Darrell, T.5
Saenko, K.6
-
35
-
-
84959876769
-
Translating videos to natural language using deep recurrent neural networks
-
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In Proceedings of ACL, 2015.
-
(2015)
Proceedings of ACL
-
-
Venugopalan, S.1
Xu, H.2
Donahue, J.3
Rohrbach, M.4
Mooney, R.5
Saenko, K.6
-
37
-
-
84970002232
-
Show, attend and tell: Neural image caption generation with visual attention
-
K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of ICML, 2015.
-
(2015)
Proceedings of ICML
-
-
Xu, K.1
Ba, J.2
Kiros, R.3
Courville, A.4
Salakhutdinov, R.5
Zemel, R.6
Bengio, Y.7
-
38
-
-
80053258778
-
Corpus-guided sentence generation of natural images
-
Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In Proceedings of Intl Conference on Empirical Methods in Natural Language Processing, pages 444-454, 2011.
-
(2011)
Proceedings of Intl Conference on Empirical Methods in Natural Language Processing
, pp. 444-454
-
-
Yang, Y.1
Teo, C.L.2
Daumé, H.3
Aloimonos, Y.4
-
39
-
-
84973884896
-
Describing videos by exploiting temporal structure
-
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In Proceedings of ICCV, 2015.
-
(2015)
Proceedings of ICCV
-
-
Yao, L.1
Torabi, A.2
Cho, K.3
Ballas, N.4
Pal, C.5
Larochelle, H.6
Courville, A.7
-
40
-
-
84887428782
-
Annotation for free: Video tagging by mining user search behavior
-
ACM
-
T. Yao, T. Mei, C.-W. Ngo, and S. Li. Annotation for free: Video tagging by mining user search behavior. In Proceedings of the 21st ACM international conference on Multimedia, pages 977-986. ACM, 2013.
-
(2013)
Proceedings of the 21st ACM International Conference on Multimedia
, pp. 977-986
-
-
Yao, T.1
Mei, T.2
Ngo, C.-W.3
Li, S.4
-
41
-
-
84906494296
-
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
-
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67-78, 2014.
-
(2014)
Transactions of the Association for Computational Linguistics
, vol.2
, pp. 67-78
-
-
Young, P.1
Lai, A.2
Hodosh, M.3
Hockenmaier, J.4
-
42
-
-
37848999908
-
Building a comprehensive ontology to refine video concept detection
-
ACM
-
Z.-J. Zha, T. Mei, Z. Wang, and X.-S. Hua. Building a comprehensive ontology to refine video concept detection. In Proceedings of the international workshop on Workshop on multimedia information retrieval, pages 227-236. ACM, 2007.
-
(2007)
Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval
, pp. 227-236
-
-
Zha, Z.-J.1
Mei, T.2
Wang, Z.3
Hua, X.-S.4
|