-
2
-
-
84973890960
-
VQA: Visual question answering
-
2, 3, 7
-
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In Proc. IEEE Int. Conf. Comp. Vis., 2015.
-
(2015)
Proc. IEEE Int. Conf. Comp. Vis.
-
-
Antol, S.1
Agrawal, A.2
Lu, J.3
Mitchell, M.4
Batra, D.5
Zitnick, C.L.6
Parikh, D.7
-
5
-
-
84952349295
-
-
arXiv: 1504. 00325, 6
-
X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. ArXiv: 1504. 00325, 2015.
-
(2015)
Microsoft COCO Captions: Data Collection and Evaluation Server
-
-
Chen, X.1
Fang, H.2
Lin, T.-Y.3
Vedantam, R.4
Gupta, S.5
Dollar, P.6
Zitnick, C.L.7
-
7
-
-
84961291190
-
Learning phrase representations using rnn encoder-decoder for statistical machine translation
-
1
-
K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proc. Conf. Empirical Methods in Natural Language Processing, 2014.
-
(2014)
Proc. Conf. Empirical Methods in Natural Language Processing
-
-
Cho, K.1
Van Merrienboer, B.2
Gulcehre, C.3
Bougares, F.4
Schwenk, H.5
Bengio, Y.6
-
8
-
-
85198028989
-
Imagenet: A large-scale hierarchical image database
-
4
-
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009.
-
(2009)
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
-
-
Deng, J.1
Dong, W.2
Socher, R.3
Li, L.-J.4
Li, K.5
Fei-Fei, L.6
-
9
-
-
84944096380
-
Language models for image captioning: The quirks and what works
-
2
-
J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M. Mitchell. Language models for image captioning: The quirks and what works. In Proc. IEEE Int. Conf. Comp. Vis., 2015.
-
(2015)
Proc. IEEE Int. Conf. Comp. Vis.
-
-
Devlin, J.1
Cheng, H.2
Fang, H.3
Gupta, S.4
Deng, L.5
He, X.6
Zweig, G.7
Mitchell, M.8
-
10
-
-
84959236502
-
Long-term recurrent convolutional networks for visual recognition and description
-
1, 3, 5, 6, 7
-
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
-
(2015)
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
-
-
Donahue, J.1
Hendricks, L.A.2
Guadarrama, S.3
Rohrbach, M.4
Venugopalan, S.5
Saenko, K.6
Darrell, T.7
-
11
-
-
84959250180
-
From captions to visual concepts and back
-
2, 3, 4, 5, 6
-
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al. From captions to visual concepts and back. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
-
(2015)
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
-
-
Fang, H.1
Gupta, S.2
Iandola, F.3
Srivastava, R.4
Deng, L.5
Dollár, P.6
Gao, J.7
He, X.8
Mitchell, M.9
Platt, J.10
-
12
-
-
80052017343
-
Every picture tells a story: Generating sentences from images
-
2, 3
-
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In Proc. Eur. Conf. Comp. Vis. 2010.
-
(2010)
Proc. Eur. Conf. Comp. Vis.
-
-
Farhadi, A.1
Hejrati, M.2
Sadeghi, M.A.3
Young, P.4
Rashtchian, C.5
Hockenmaier, J.6
Forsyth, D.7
-
13
-
-
84965148420
-
Are you talking to a machine dataset and methods for multilingual image question answering
-
2, 3, 5
-
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are You Talking to a Machine Dataset and Methods for Multilingual Image Question Answering. In Proc. Advances in Neural Inf. Process. Syst., 2015.
-
(2015)
Proc. Advances in Neural Inf. Process. Syst.
-
-
Gao, H.1
Mao, J.2
Zhou, J.3
Huang, Z.4
Wang, L.5
Xu, W.6
-
14
-
-
84862277874
-
Understanding the difficulty of training deep feedforward neural networks
-
4
-
X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proc. Int. Conf. Artificial Intell. & Stat., pages 249-256, 2010.
-
(2010)
Proc. Int. Conf. Artificial Intell. & Stat
, pp. 249-256
-
-
Glorot, X.1
Bengio, Y.2
-
17
-
-
84883394520
-
Framing image description as a ranking task: Data, models and evaluation metrics
-
2, 5
-
M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, pages 853-899, 2013.
-
(2013)
JAIR
, pp. 853-899
-
-
Hodosh, M.1
Young, P.2
Hockenmaier, J.3
-
20
-
-
85009867858
-
-
arXiv: 1408. 5093, 6
-
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. ArXiv: 1408. 5093, 2014.
-
(2014)
Caffe: Convolutional Architecture for Fast Feature Embedding
-
-
Jia, Y.1
Shelhamer, E.2
Donahue, J.3
Karayev, S.4
Long, J.5
Girshick, R.6
Guadarrama, S.7
Darrell, T.8
-
21
-
-
84986312327
-
-
arXiv: 1506. 06272, 6
-
J. Jin, K. Fu, R. Cui, F. Sha, and C. Zhang. Aligning where to see and what to tell: image caption with region-based attention and scene factorization. ArXiv: 1506. 06272, 2015.
-
(2015)
Aligning Where to See and What to Tell: Image Caption with Region-based Attention and Scene Factorization
-
-
Jin, J.1
Fu, K.2
Cui, R.3
Sha, F.4
Zhang, C.5
-
26
-
-
85009854844
-
-
2, 3
-
G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. IEEE Trans. Pattern Anal. Mach. Intell.
-
IEEE Trans. Pattern Anal. Mach. Intell.
-
-
Kulkarni, G.1
Premraj, V.2
Ordonez, V.3
Dhar, S.4
Li, S.5
Choi, Y.6
Berg, A.C.7
Berg, T.L.8
-
27
-
-
84878189119
-
Collective generation of natural image descriptions
-
2
-
P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Collective generation of natural image descriptions. In Proc. Conf. Association for Computational Linguistics, 2012.
-
(2012)
Proc. Conf. Association for Computational Linguistics
-
-
Kuznetsova, P.1
Ordonez, V.2
Berg, A.C.3
Berg, T.L.4
Choi, Y.5
-
29
-
-
0032203257
-
Gradientbased learning applied to document recognition
-
1
-
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proc. IEEE, 86 (11): 2278-2324, 1998.
-
(1998)
Proc. IEEE
, vol.86
, Issue.11
, pp. 2278-2324
-
-
LeCun, Y.1
Bottou, L.2
Bengio, Y.3
Haffner, P.4
-
30
-
-
84862279067
-
Composing simple image descriptions using web-scale n-grams
-
2, 3
-
S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale n-grams. In CoNLL, 2011.
-
(2011)
CoNLL
-
-
Li, S.1
Kulkarni, G.2
Berg, T.L.3
Berg, A.C.4
Choi, Y.5
-
31
-
-
85009838903
-
Microsoft COCO: Common objects in context
-
5
-
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In Proc. Eur. Conf. Comp. Vis. 2014.
-
(2014)
Proc. Eur. Conf. Comp. Vis.
-
-
Lin, T.-Y.1
Maire, M.2
Belongie, S.3
Hays, J.4
Perona, P.5
Ramanan, D.6
Dollár, P.7
Zitnick, C.L.8
-
32
-
-
85007153677
-
Learning to answer questions from image using convolutional neural network
-
3, 7
-
L. Ma, Z. Lu, and H. Li. Learning to Answer Questions From Image using Convolutional Neural Network. In AAAI, 2016.
-
(2016)
AAAI
-
-
Ma, L.1
Lu, Z.2
Li, H.3
-
33
-
-
84937822746
-
A multi-world approach to question answering about real-world scenes based on uncertain input
-
3
-
M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proc. Advances in Neural Inf. Process. Syst., pages 1682-1690, 2014.
-
(2014)
Proc. Advances in Neural Inf. Process. Syst
, pp. 1682-1690
-
-
Malinowski, M.1
Fritz, M.2
-
36
-
-
85083950512
-
Deep captioning with multimodal recurrent neural networks (m-RNN)
-
1, 2, 4, 5, 6, 7
-
J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). In Proc. Int. Conf. Learn. Representations, 2015.
-
(2015)
Proc. Int. Conf. Learn. Representations
-
-
Mao, J.1
Xu, W.2
Yang, Y.3
Wang, J.4
Yuille, A.5
-
37
-
-
84898956512
-
Distributed representations of words and phrases and their compositionality
-
8
-
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proc. Advances in Neural Inf. Process. Syst., pages 3111-3119, 2013.
-
(2013)
Proc. Advances in Neural Inf. Process. Syst
, pp. 3111-3119
-
-
Mikolov, T.1
Sutskever, I.2
Chen, K.3
Corrado, G.S.4
Dean, J.5
-
38
-
-
84976702763
-
WordNet: A lexical database for English
-
8
-
G. A. Miller. WordNet: A lexical database for English. Communications of the ACM, 38 (11): 39-41, 1995.
-
(1995)
Communications of the ACM
, vol.38
, Issue.11
, pp. 39-41
-
-
Miller, G.A.1
-
39
-
-
85034832841
-
Midge: Generating image descriptions from computer vision detections
-
2
-
M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. Midge: Generating image descriptions from computer vision detections. In EACL, 2012.
-
(2012)
EACL
-
-
Mitchell, M.1
Han, X.2
Dodge, J.3
Mensch, A.4
Goyal, A.5
Berg, A.6
Yamaguchi, K.7
Berg, T.8
Stratos, K.9
Daumé, H.10
-
42
-
-
84973900209
-
-
arXiv: 1503. 00848, March.
-
J. Pont-Tuset, P. Arbeláez, J. Barron, F. Marques, and J. Malik. Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation. In arXiv: 1503. 00848, March 2015.
-
(2015)
Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation
-
-
Pont-Tuset, J.1
Arbeláez, P.2
Barron, J.3
Marques, F.4
Malik, J.5
-
44
-
-
84898775239
-
Translating video content to natural language descriptions
-
3
-
M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In Proc. IEEE Int. Conf. Comp. Vis., 2013.
-
(2013)
Proc. IEEE Int. Conf. Comp. Vis.
-
-
Rohrbach, M.1
Qiu, W.2
Titov, I.3
Thater, S.4
Pinkal, M.5
Schiele, B.6
-
46
-
-
84906925854
-
Grounded compositional semantics for finding and describing images with sentences
-
2
-
R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. Proc. Conf. Association for Computational Linguistics, 2014.
-
(2014)
Proc. Conf. Association for Computational Linguistics
-
-
Socher, R.1
Karpathy, A.2
Le, Q.V.3
Manning, C.D.4
Ng, A.Y.5
-
48
-
-
84937522268
-
Going deeper with convolutions
-
1, 6
-
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
-
(2015)
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
-
-
Szegedy, C.1
Liu, W.2
Jia, Y.3
Sermanet, P.4
Reed, S.5
Anguelov, D.6
Erhan, D.7
Vanhoucke, V.8
Rabinovich, A.9
-
50
-
-
84939821075
-
Show and tell: A neural image caption generator
-
1, 2, 3, 4, 5, 6, 7
-
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2014.
-
(2014)
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
-
-
Vinyals, O.1
Toshev, A.2
Bengio, S.3
Erhan, D.4
-
51
-
-
84938908409
-
-
arXiv: 1406. 5726, 4
-
Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan. CNN: Single-label to multi-label. ArXiv: 1406. 5726, 2014.
-
(2014)
CNN: Single-label to Multi-label
-
-
Wei, Y.1
Xia, W.2
Huang, J.3
Ni, B.4
Dong, J.5
Zhao, Y.6
Yan, S.7
-
53
-
-
84970002232
-
Show, Attend and tell: Neural image caption generation with visual attention
-
2, 5, 6
-
K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proc. Int. Conf. Mach. Learn., 2015.
-
(2015)
Proc. Int. Conf. Mach. Learn.
-
-
Xu, K.1
Ba, J.2
Kiros, R.3
Courville, A.4
Salakhutdinov, R.5
Zemel, R.6
Bengio, Y.7
-
55
-
-
84973884896
-
Describing videos by exploiting temporal structure
-
1
-
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In Proc. IEEE Int. Conf. Comp. Vis., 2015.
-
(2015)
Proc. IEEE Int. Conf. Comp. Vis.
-
-
Yao, L.1
Torabi, A.2
Cho, K.3
Ballas, N.4
Pal, C.5
Larochelle, H.6
Courville, A.7
-
56
-
-
84986317307
-
Image captioning with semantic attention
-
June.
-
Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2016.
-
(2016)
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
-
-
You, Q.1
Jin, H.2
Wang, Z.3
Fang, C.4
Luo, J.5
-
57
-
-
84906494296
-
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
-
5
-
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Proc. Conf. Association for Computational Linguistics, 2, 2014.
-
(2014)
Proc. Conf. Association for Computational Linguistics
, vol.2
-
-
Young, P.1
Lai, A.2
Hodosh, M.3
Hockenmaier, J.4
-
59
-
-
84986248327
-
-
arXiv: 1507. 05670, 3, 8
-
Y. Zhu, C. Zhang, C. Ré, and L. Fei-Fei. Building a Largescale Multimodal Knowledge Base for Visual Question Answering. ArXiv: 1507. 05670, 2015.
-
(2015)
Building A Largescale Multimodal Knowledge Base for Visual Question Answering
-
-
Zhu, Y.1
Zhang, C.2
Ré, C.3
Fei-Fei, L.4
|