-
2
-
-
84952349296
-
-
arxiv. org 1505 01809. 1
-
J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M. Mitchell. Language models for image captioning: The quirks and what works. In arxiv. org: 1505. 01809, 2015. 1
-
(2015)
Language Models for Image Captioning: The Quirks and What Works
-
-
Devlin, J.1
Cheng, H.2
Fang, H.3
Gupta, S.4
Deng, L.5
He, X.6
Zweig, G.7
Mitchell, M.8
-
3
-
-
84901455535
-
Detecting visual text
-
8
-
J. Dodge, A. Goyal, X. Han, A. Mensch, M. Mitchell, K. Stratos, K. Yamaguchi, Y. Choi, H. D. III, A. C. Berg, and T. L. Berg. Detecting visual text. In NAACL, 2012. 8
-
(2012)
NAACL
-
-
Dodge, J.1
Goyal, A.2
Han, X.3
Mensch, A.4
Mitchell, M.5
Stratos, K.6
Yamaguchi, K.7
Choi, Y.8
Berg, A.C.9
Berg, T.L.10
-
4
-
-
84944046597
-
-
arXiv 1411 4389. 1
-
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. arXiv: 1411. 4389, 2014. 1
-
(2014)
Long-term Recurrent Convolutional Networks for Visual Recognition and Description
-
-
Donahue, J.1
Hendricks, L.A.2
Guadarrama, S.3
Rohrbach, M.4
Venugopalan, S.5
Saenko, K.6
Darrell, T.7
-
5
-
-
77951298115
-
The Pascal visual object classes (VOC) challenge
-
7
-
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal visual object classes (VOC) challenge. IJCV, 88 (2): 303-338, 2010. 7
-
(2010)
IJCV
, vol.88
, Issue.2
, pp. 303-338
-
-
Everingham, M.1
Van Gool, L.2
Williams, C.K.3
Winn, J.4
Zisserman, A.5
-
6
-
-
84944115860
-
-
arXiv 1411 4952. 1
-
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al. From captions to visual concepts and back. arXiv: 1411. 4952, 2014. 1
-
(2014)
From Captions to Visual Concepts and Back
-
-
Fang, H.1
Gupta, S.2
Iandola, F.3
Srivastava, R.4
Deng, L.5
Dollár, P.6
Gao, J.7
He, X.8
Mitchell, M.9
Platt, J.10
-
7
-
-
80052017343
-
Every picture tells a story: Generating sentences from images
-
1, 8
-
A. Farhadi, S. Hejrati, A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. A. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV. 2010. 1, 8
-
(2010)
ECCV
-
-
Farhadi, A.1
Hejrati, S.2
Sadeghi, A.3
Young, P.4
Rashtchian, C.5
Hockenmaier, J.6
Forsyth, D.A.7
-
8
-
-
84887365305
-
A sentence is worth a thousand pixels
-
8
-
S. Fidler, A. Sharma, and R. Urtasun. A sentence is worth a thousand pixels. In CVPR, 2013. 8
-
(2013)
CVPR
-
-
Fidler, S.1
Sharma, A.2
Urtasun, R.3
-
9
-
-
84894905366
-
A multi-view embedding space for modeling internet images, tags, and their semantics
-
6
-
Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106 (2): 210-233, 2014. 6
-
(2014)
IJCV
, vol.106
, Issue.2
, pp. 210-233
-
-
Gong, Y.1
Ke, Q.2
Isard, M.3
Lazebnik, S.4
-
10
-
-
84959243872
-
Improving image-sentence embeddings using large weakly annotated photo collections
-
1, 5, 6, 7
-
Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, 2014. 1, 5, 6, 7
-
(2014)
ECCV
-
-
Gong, Y.1
Wang, L.2
Hodosh, M.3
Hockenmaier, J.4
Lazebnik, S.5
-
11
-
-
38049183286
-
The iapr tc-12 benchmark: A new evaluation resource for visual information systems
-
1, 2
-
M. Grubinger, P. Clough, H. Müller, and T. Deselaers. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International Workshop OntoImage, pages 13-23, 2006. 1, 2
-
(2006)
International Workshop OntoImage
, pp. 13-23
-
-
Grubinger, M.1
Clough, P.2
Müller, H.3
Deselaers, T.4
-
12
-
-
84883394520
-
Framing image description as a ranking task: Data, models and evaluation metrics
-
1
-
M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 2013. 1
-
(2013)
JAIR
-
-
Hodosh, M.1
Young, P.2
Hockenmaier, J.3
-
13
-
-
84862286506
-
Crosscaption coreference resolution for automatic image understanding
-
3, 8. ACL
-
M. Hodosh, P. Young, C. Rashtchian, and J. Hockenmaier. Crosscaption coreference resolution for automatic image understanding. In CoNLL, pages 162-171. ACL, 2010. 3, 8
-
(2010)
CoNLL
, pp. 162-171
-
-
Hodosh, M.1
Young, P.2
Rashtchian, C.3
Hockenmaier, J.4
-
14
-
-
0000107975
-
Relations between two sets of variates
-
5
-
H. Hotelling. Relations between two sets of variates. Biometrika, pages 321-377, 1936. 5
-
(1936)
Biometrika
, pp. 321-377
-
-
Hotelling, H.1
-
15
-
-
84959233256
-
Image retrieval using scene graphs
-
2
-
J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015. 2
-
(2015)
CVPR
-
-
Johnson, J.1
Krishna, R.2
Stark, M.3
Li, L.-J.4
Shamma, D.A.5
Bernstein, M.6
Fei-Fei, L.7
-
16
-
-
84942676733
-
-
arXiv 1412 2306. 1, 5, 6, 7, 8
-
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arXiv: 1412. 2306, 2014. 1, 5, 6, 7, 8
-
(2014)
Deep Visual-semantic Alignments for Generating Image Descriptions
-
-
Karpathy, A.1
Fei-Fei, L.2
-
17
-
-
84937843643
-
Deep fragment embeddings for bidirectional image sentence mapping
-
1
-
A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, 2014. 1
-
(2014)
NIPS
-
-
Karpathy, A.1
Joulin, A.2
Fei-Fei, L.3
-
18
-
-
84943540775
-
Referitgame: Referring to objects in photographs of natural scenes
-
2
-
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014. 2
-
(2014)
EMNLP
-
-
Kazemzadeh, S.1
Ordonez, V.2
Matten, M.3
Berg, T.4
-
19
-
-
84944113729
-
-
arXiv 1411 2539. 1, 5, 7, 8
-
R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visualsemantic embeddings with multimodal neural language models. arXiv: 1411. 2539, 2014. 1, 5, 7, 8
-
(2014)
Unifying Visualsemantic Embeddings with Multimodal Neural Language Models
-
-
Kiros, R.1
Salakhutdinov, R.2
Zemel, R.S.3
-
20
-
-
84965125568
-
Fisher vectors derived from hybrid Gaussian-laplacian mixture models for image annotation
-
1, 5, 6, 7, 8
-
B. Klein, G. Lev, G. Sadeh, and L. Wolf. Fisher vectors derived from hybrid Gaussian-laplacian mixture models for image annotation. CVPR, 2015. 1, 5, 6, 7, 8
-
(2015)
CVPR
-
-
Klein, B.1
Lev, G.2
Sadeh, G.3
Wolf, L.4
-
21
-
-
84911370987
-
What are you talking about? Text-to-image coreference
-
5
-
C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about? text-to-image coreference. In CVPR, 2014. 5
-
(2014)
CVPR
-
-
Kong, C.1
Lin, D.2
Bansal, M.3
Urtasun, R.4
Fidler, S.5
-
22
-
-
80052901011
-
Baby talk: Understanding and generating image descriptions
-
1, 8
-
G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating image descriptions. In CVPR, 2011. 1, 8
-
(2011)
CVPR
-
-
Kulkarni, G.1
Premraj, V.2
Dhar, S.3
Li, S.4
Choi, Y.5
Berg, A.C.6
Berg, T.L.7
-
24
-
-
84937834115
-
Microsoft COCO: Common objects in context
-
1
-
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 1
-
(2014)
ECCV
-
-
Lin, T.-Y.1
Maire, M.2
Belongie, S.3
Hays, J.4
Perona, P.5
Ramanan, D.6
Dollár, P.7
Zitnick, C.L.8
-
25
-
-
84939821073
-
-
arXiv 1412 6632. 1, 5, 6, 7, 8
-
J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv: 1412. 6632, 2014. 1, 5, 6, 7, 8
-
(2014)
Deep Captioning with Multimodal Recurrent Neural Networks (M-rnn)
-
-
Mao, J.1
Xu, W.2
Yang, Y.3
Wang, J.4
Yuille, A.5
-
27
-
-
84898956512
-
Distributed representations of words and phrases and their compositionality
-
5
-
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. 5
-
(2013)
NIPS
-
-
Mikolov, T.1
Sutskever, I.2
Chen, K.3
Corrado, G.S.4
Dean, J.5
-
28
-
-
85162522202
-
Im2Text: Describing images using 1 million captioned photographs
-
1
-
V. Ordonez, G. Kulkarni, and T. L. Berg. Im2Text: Describing images using 1 million captioned photographs. NIPS, 2011. 1
-
(2011)
NIPS
-
-
Ordonez, V.1
Kulkarni, G.2
Berg, T.L.3
-
29
-
-
79959771606
-
Improving the fisher kernel for large-scale image classification
-
5
-
F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In ECCV, 2010. 5
-
(2010)
ECCV
-
-
Perronnin, F.1
Sánchez, J.2
Mensink, T.3
-
30
-
-
84943782750
-
Linking people in videos with "their" names using coreference resolution
-
3
-
V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Linking people in videos with "their" names using coreference resolution. In ECCV, 2014. 3
-
(2014)
ECCV
-
-
Ramanathan, V.1
Joulin, A.2
Liang, P.3
Fei-Fei, L.4
-
31
-
-
85090348677
-
Collecting image annotations using Amazon's mechanical turk
-
1, 4. ACL
-
C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier. Collecting image annotations using Amazon's mechanical turk. In NAACL HLT Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, pages 139-147. ACL, 2010. 1, 4
-
(2010)
NAACL HLT Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
, pp. 139-147
-
-
Rashtchian, C.1
Young, P.2
Hodosh, M.3
Hockenmaier, J.4
-
34
-
-
0039891959
-
A machine learning approach to coreference resolution of noun phrases
-
3
-
W. M. Soon, H. T. Ng, and D. C. Y. Lim. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27 (4): 521-544, 2001. 3
-
(2001)
Computational Linguistics
, vol.27
, Issue.4
, pp. 521-544
-
-
Soon, W.M.1
Ng, H.T.2
Lim, D.C.Y.3
-
38
-
-
84939821074
-
-
arXiv 1502 03044. 1
-
K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv: 1502. 03044, 2015. 1
-
(2015)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
-
-
Xu, K.1
Ba, J.2
Kiros, R.3
Courville, A.4
Salakhutdinov, R.5
Zemel, R.6
Bengio, Y.7
-
39
-
-
77954862144
-
I2T: Image parsing to text description
-
1
-
B. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2T: Image parsing to text description. Proc. IEEE, 98 (8): 1485-1508, 2010. 1
-
(2010)
Proc. IEEE
, vol.98
, Issue.8
, pp. 1485-1508
-
-
Yao, B.1
Yang, X.2
Lin, L.3
Lee, M.W.4
Zhu, S.-C.5
-
40
-
-
84906494296
-
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
-
1, 3
-
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2: 67-78, 2014. 1, 3
-
(2014)
TACL
, vol.2
, pp. 67-78
-
-
Young, P.1
Lai, A.2
Hodosh, M.3
Hockenmaier, J.4
-
41
-
-
84952018709
-
Edge boxes: Locating object proposals from edges
-
6
-
C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 6
-
(2014)
ECCV
-
-
Zitnick, C.L.1
Dollár, P.2
|