-
2
-
-
84959502295
-
-
arXiv, 2
-
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. ArXiv, 2015.
-
(2015)
Vqa: Visual Question Answering
-
-
Antol, S.1
Agrawal, A.2
Lu, J.3
Mitchell, M.4
Batra, D.5
Zitnick, C.L.6
Parikh, D.7
-
3
-
-
0022890536
-
Maximum mutual information estimation of hidden Markov model parameters for speech recognition
-
Apr.
-
L. Bahl, P. Brown, P. V. de Souza, and R. Mercer. Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In ICASSP, volume 11, pages 49-52, Apr. 1986.
-
(1986)
ICASSP
, vol.11
, pp. 49-52
-
-
Bahl, L.1
Brown, P.2
De Souza, P.V.3
Mercer, R.4
-
4
-
-
84986255868
-
-
arXiv preprint arXiv: 1508. 06161, 2
-
D. P. Barrett, S. A. Bronikowski, H. Yu, and J. M. Siskind. Robot language learning, generation, and comprehension. ArXiv preprint arXiv: 1508. 06161, 2015.
-
(2015)
Robot Language Learning, Generation, and Comprehension
-
-
Barrett, D.P.1
Bronikowski, S.A.2
Yu, H.3
Siskind, J.M.4
-
5
-
-
84957029470
-
Mind's eye: A recurrent visual representation for image caption generation
-
X. Chen and C. L. Zitnick. Mind's eye: A recurrent visual representation for image caption generation. In CVPR, 2015.
-
(2015)
CVPR
, vol.1
, pp. 2
-
-
Chen, X.1
Zitnick, C.L.2
-
6
-
-
84905732098
-
ImageSpirit: Verbal guided image parsing
-
2
-
M.-M. Cheng, S. Zheng, W.-Y. Lin, V. Vineet, P. Sturgess, N. Crook, N. J. Mitra, and P. Torr. ImageSpirit: Verbal guided image parsing. ACM Trans. Graphics, 2014.
-
(2014)
ACM Trans. Graphics
-
-
Cheng, M.-M.1
Zheng, S.2
Lin, W.-Y.3
Vineet, V.4
Sturgess, P.5
Crook, N.6
Mitra, N.J.7
Torr, P.8
-
7
-
-
85198028989
-
Imagenet: A large-scale hierarchical image database
-
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248-255, 2009.
-
(2009)
CVPR
, vol.4
, pp. 248-255
-
-
Deng, J.1
Dong, W.2
Socher, R.3
Li, L.-J.4
Li, K.5
Fei-Fei, L.6
-
8
-
-
84965102873
-
-
arXiv preprint arXiv: 1505. 04467, 1
-
J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches for image captioning. ArXiv preprint arXiv: 1505. 04467, 2015.
-
(2015)
Exploring Nearest Neighbor Approaches for Image Captioning
-
-
Devlin, J.1
Gupta, S.2
Girshick, R.3
Mitchell, M.4
Zitnick, C.L.5
-
9
-
-
84959236502
-
Long-term recurrent convolutional networks for visual recognition and description
-
1, 2, 4
-
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
-
(2015)
CVPR
-
-
Donahue, J.1
Hendricks, L.A.2
Guadarrama, S.3
Rohrbach, M.4
Venugopalan, S.5
Saenko, K.6
Darrell, T.7
-
10
-
-
84911443425
-
Scalable object detection using deep neural networks
-
4
-
D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, pages 2155-2162, 2014.
-
(2014)
CVPR
, pp. 2155-2162
-
-
Erhan, D.1
Szegedy, C.2
Toshev, A.3
Anguelov, D.4
-
11
-
-
77649188328
-
The segmented and annotated IAPR TC-12 benchmark
-
2
-
H. J. Escalante, C. A. Hernandez, J. A. Gonzalez, A. Lopez-Lopez, M. Montes, E. F. Morales, L. E. Sucar, L. Villasenor, and M. Grubinger. The segmented and annotated IAPR TC-12 benchmark. CVIU, 2010.
-
(2010)
CVIU
-
-
Escalante, H.J.1
Hernandez, C.A.2
Gonzalez, J.A.3
Lopez-Lopez, A.4
Montes, M.5
Morales, E.F.6
Sucar, L.E.7
Villasenor, L.8
Grubinger, M.9
-
12
-
-
84959250180
-
From captions to visual concepts and back
-
1, 2
-
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. Platt, et al. From captions to visual concepts and back. In CVPR, 2015.
-
(2015)
CVPR
-
-
Fang, H.1
Gupta, S.2
Iandola, F.3
Srivastava, R.4
Deng, L.5
Dollár, P.6
Gao, J.7
He, X.8
Mitchell, M.9
Platt, J.10
-
13
-
-
78149311145
-
Every picture tells a story: Generating sentences from images
-
1, 2
-
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, pages 15-29. 2010.
-
(2010)
ECCV
, pp. 15-29
-
-
Farhadi, A.1
Hejrati, M.2
Sadeghi, M.A.3
Young, P.4
Rashtchian, C.5
Hockenmaier, J.6
Forsyth, D.7
-
14
-
-
84908171707
-
Learning distributions over logical forms for referring expression generation
-
1, 2
-
N. FitzGerald, Y. Artzi, and L. S. Zettlemoyer. Learning distributions over logical forms for referring expression generation. In EMNLP, pages 1914-1925, 2013.
-
(2013)
EMNLP
, pp. 1914-1925
-
-
FitzGerald, N.1
Artzi, Y.2
Zettlemoyer, L.S.3
-
15
-
-
84965148420
-
Are you talking to a machine dataset and methods for multilingual image question answering
-
2
-
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine dataset and methods for multilingual image question answering. In NIPS, 2015.
-
(2015)
NIPS
-
-
Gao, H.1
Mao, J.2
Zhou, J.3
Huang, Z.4
Wang, L.5
Xu, W.6
-
16
-
-
84925422907
-
Visual turing test for computer vision systems
-
2
-
D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual turing test for computer vision systems. PANS, 112 (12): 3618-3623, 2015.
-
(2015)
PANS
, vol.112
, Issue.12
, pp. 3618-3623
-
-
Geman, D.1
Geman, S.2
Hallonquist, N.3
Younes, L.4
-
17
-
-
84911400494
-
Rich feature hierarchies for accurate object detection and semantic segmentation
-
4
-
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
-
(2014)
CVPR
-
-
Girshick, R.1
Donahue, J.2
Darrell, T.3
Malik, J.4
-
18
-
-
84954299658
-
From the virtual to the real world: Referring to objects in Real-World spatial scenes
-
2
-
D. Gkatzia, V. Rieser, P. Bartie, andW. Mackaness. From the virtual to the real world: Referring to objects in Real-World spatial scenes. In EMNLP, 2015.
-
(2015)
EMNLP
-
-
Gkatzia, D.1
Rieser, V.2
Bartie, P.3
Mackaness, W.4
-
19
-
-
80053265931
-
A game-theoretic approach to generating spatial descriptions
-
1, 2, 5
-
D. Golland, P. Liang, and D. Klein. A game-theoretic approach to generating spatial descriptions. In EMNLP, pages 410-419, 2010.
-
(2010)
EMNLP
, pp. 410-419
-
-
Golland, D.1
Liang, P.2
Klein, D.3
-
20
-
-
84905579579
-
Probabilistic semantics and pragmatics: Uncertainty in language and thought
-
Wiley-Blackwell, 2
-
N. D. Goodman and D. Lassiter. Probabilistic semantics and pragmatics: Uncertainty in language and thought. Handbook of Contemporary Semantic Theory. Wiley-Blackwell, 2014.
-
(2014)
Handbook of Contemporary Semantic Theory
-
-
Goodman, N.D.1
Lassiter, D.2
-
21
-
-
85009871094
-
Lstm: A search space odyssey
-
4
-
K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink, and J. Schmidhuber. Lstm: A search space odyssey. In ICML, 2015.
-
(2015)
ICML
-
-
Greff, K.1
Srivastava, R.K.2
Koutnk, J.3
Steunebrink, B.R.4
Schmidhuber, J.5
-
22
-
-
85009917737
-
Logic and conversation
-
2
-
H. P. Grice. Logic and conversation. na, 1970.
-
(1970)
Na
-
-
Grice, H.P.1
-
23
-
-
84883394520
-
Framing image description as a ranking task: Data, models and evaluation metrics
-
2
-
M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47: 853-899, 2013.
-
(2013)
JAIR
, vol.47
, pp. 853-899
-
-
Hodosh, M.1
Young, P.2
Hockenmaier, J.3
-
24
-
-
84986305787
-
Natural language object retrieval
-
2
-
R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. CVPR, 2016.
-
(2016)
CVPR
-
-
Hu, R.1
Xu, H.2
Rohrbach, M.3
Feng, J.4
Saenko, K.5
Darrell, T.6
-
27
-
-
84943540775
-
Referitgame: Referring to objects in photographs of natural scenes
-
1, 2, 3
-
S. Kazemzadeh, V. Ordonez, M. Matten, and T. L. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787-798, 2014.
-
(2014)
EMNLP
, pp. 787-798
-
-
Kazemzadeh, S.1
Ordonez, V.2
Matten, M.3
Berg, T.L.4
-
31
-
-
84856184938
-
Computational generation of referring expressions: A survey
-
1, 2
-
E. Krahmer and K. van Deemter. Computational generation of referring expressions: A survey. Comp. Linguistics, 38, 2012.
-
(2012)
Comp. Linguistics
, vol.38
-
-
Krahmer, E.1
Van Deemter, K.2
-
32
-
-
84978730111
-
-
arXiv preprint arXiv: 1602. 07332, 2
-
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. ArXiv preprint arXiv: 1602. 07332, 2016.
-
(2016)
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
-
-
Krishna, R.1
Zhu, Y.2
Groth, O.3
Johnson, J.4
Hata, K.5
Kravitz, J.6
Chen, S.7
Kalantidis, Y.8
Li, L.-J.9
Shamma, D.A.10
-
33
-
-
84876231242
-
Imagenet classification with deep convolutional neural networks
-
4
-
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097-1105, 2012.
-
(2012)
NIPS
, pp. 1097-1105
-
-
Krizhevsky, A.1
Sutskever, I.2
Hinton, G.E.3
-
34
-
-
80052901011
-
Baby talk: Understanding and generating image descriptions
-
2
-
G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Baby talk: Understanding and generating image descriptions. In CVPR, 2011.
-
(2011)
CVPR
-
-
Kulkarni, G.1
Premraj, V.2
Dhar, S.3
Li, S.4
Choi, Y.5
Berg, A.C.6
Berg, T.L.7
-
35
-
-
85120046073
-
Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgements
-
6
-
A. Lavie and A. Agarwal. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgements. In Workshop on Statistical Machine Translation, pages 228-231, 2007.
-
(2007)
Workshop on Statistical Machine Translation
, pp. 228-231
-
-
Lavie, A.1
Agarwal, A.2
-
36
-
-
84862279067
-
Composing simple image descriptions using web-scale n-grams
-
2
-
S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Composing simple image descriptions using web-scale n-grams. In CoNLL, pages 220-228, 2011.
-
(2011)
CoNLL
, pp. 220-228
-
-
Li, S.1
Kulkarni, G.2
Berg, T.L.3
Berg, A.C.4
Choi, Y.5
-
37
-
-
85009931853
-
Microsoft coco: Common objects in context
-
2, 3
-
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-
(2014)
ECCV
-
-
Lin, T.-Y.1
Maire, M.2
Belongie, S.3
Hays, J.4
Perona, P.5
Ramanan, D.6
Dollár, P.7
Zitnick, C.L.8
-
38
-
-
84937822746
-
A multi-world approach to question answering about real-world scenes based on uncertain input
-
2
-
M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, pages 1682-1690, 2014.
-
(2014)
NIPS
, pp. 1682-1690
-
-
Malinowski, M.1
Fritz, M.2
-
39
-
-
84986313218
-
Ask your neurons: A neural-based approach to answering questions about images
-
2
-
M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering questions about images. In NIPS, 2015.
-
(2015)
NIPS
-
-
Malinowski, M.1
Rohrbach, M.2
Fritz, M.3
-
40
-
-
85083950512
-
Deep captioning with multimodal recurrent neural networks (m-rnn)
-
1, 2, 4
-
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR, 2015.
-
(2015)
ICLR
-
-
Mao, J.1
Xu, W.2
Yang, Y.3
Wang, J.4
Huang, Z.5
Yuille, A.6
-
41
-
-
84858142989
-
Natural reference to objects in a visual domain
-
1, 2
-
M. Mitchell, K. van Deemter, and E. Reiter. Natural reference to objects in a visual domain. In INLG, pages 95-104, 2010.
-
(2010)
INLG
, pp. 95-104
-
-
Mitchell, M.1
Van Deemter, K.2
Reiter, E.3
-
42
-
-
84908171705
-
Generating expressions that refer to visible objects
-
1, 2
-
M. Mitchell, K. van Deemter, and E. Reiter. Generating expressions that refer to visible objects. In HLT-NAACL, pages 1174-1184, 2013.
-
(2013)
HLT-NAACL
, pp. 1174-1184
-
-
Mitchell, M.1
Van Deemter, K.2
Reiter, E.3
-
43
-
-
85162522202
-
Im2text: Describing images using 1 million captioned photographs
-
2
-
V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011.
-
(2011)
NIPS
-
-
Ordonez, V.1
Kulkarni, G.2
Berg, T.L.3
-
44
-
-
85133336275
-
Bleu: A method for automatic evaluation of machine translation
-
6
-
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In ACL, pages 311-318, 2002.
-
(2002)
ACL
, pp. 311-318
-
-
Papineni, K.1
Roukos, S.2
Ward, T.3
Zhu, W.-J.4
-
45
-
-
84973856017
-
Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-sentence models
-
2
-
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer imageto-sentence models. In ICCV, 2015.
-
(2015)
ICCV
-
-
Plummer, B.A.1
Wang, L.2
Cervantes, C.M.3
Caicedo, J.C.4
Hockenmaier, J.5
Lazebnik, S.6
-
46
-
-
80052889458
-
Recognition using visual phrases
-
2
-
M. A. Sadeghi and A. Farhadi. Recognition using visual phrases. In CVPR, 2011.
-
(2011)
CVPR
-
-
Sadeghi, M.A.1
Farhadi, A.2
-
47
-
-
84866654828
-
Image description with a goal: Building efficient discriminating expressions for images
-
2
-
A. Sadovnik, Y.-I. Chiu, N. Snavely, S. Edelman, and T. Chen. Image description with a goal: Building efficient discriminating expressions for images. In CVPR, 2012.
-
(2012)
CVPR
-
-
Sadovnik, A.1
Chiu, Y.-I.2
Snavely, N.3
Edelman, S.4
Chen, T.5
-
48
-
-
85083953063
-
Very deep convolutional networks for large-scale image recognition
-
4
-
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-
(2015)
ICLR
-
-
Simonyan, K.1
Zisserman, A.2
-
49
-
-
84964474107
-
Grounded compositional semantics for finding and describing images with sentences
-
2
-
R. Socher, Q. Le, C. Manning, and A. Ng. Grounded compositional semantics for finding and describing images with sentences. In TACL, 2014.
-
(2014)
TACL
-
-
Socher, R.1
Le, Q.2
Manning, C.3
Ng, A.4
-
50
-
-
84858111046
-
Building a semantically transparent corpus for the generation of referring expressions
-
1, 2
-
K. van Deemter, I. van der Sluis, and A. Gatt. Building a semantically transparent corpus for the generation of referring expressions. In INLG, pages 130-132, 2006.
-
(2006)
INLG
, pp. 130-132
-
-
Van Deemter, K.1
Sluis Der Van, I.2
Gatt, A.3
-
52
-
-
84858111888
-
The use of spatial relations in referring expression generation
-
Association for Computational Linguistics, 1, 2
-
J. Viethen and R. Dale. The use of spatial relations in referring expression generation. In INLG, pages 59-67. Association for Computational Linguistics, 2008.
-
(2008)
INLG
, pp. 59-67
-
-
Viethen, J.1
Dale, R.2
-
53
-
-
84946747440
-
Show and tell: A neural image caption generator
-
1, 2, 4
-
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
-
(2015)
CVPR
-
-
Vinyals, O.1
Toshev, A.2
Bengio, S.3
Erhan, D.4
-
54
-
-
0013359151
-
Understanding natural language
-
2
-
T. Winograd. Understanding natural language. Cognitive psychology, 3 (1): 1-191, 1972.
-
(1972)
Cognitive Psychology
, vol.3
, Issue.1
, pp. 1-191
-
-
Winograd, T.1
-
55
-
-
84970002232
-
Show, attend and tell: Neural image caption generation with visual attention
-
1, 2
-
K. Xu, J. Ba, R. Kiros, C. A. Cho, Kyunghyun, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
-
(2015)
ICML
-
-
Xu, K.1
Ba, J.2
Kiros, R.3
Cho, C.A.4
Kyunghyun5
Salakhutdinov, R.6
Zemel, R.7
Bengio, Y.8
-
56
-
-
80053258778
-
Corpus-guided sentence generation of natural images
-
2
-
Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos. Corpus-guided sentence generation of natural images. In EMNLP, pages 444-454, 2011.
-
(2011)
EMNLP
, pp. 444-454
-
-
Yang, Y.1
Teo, C.L.2
Daumé, H.3
Aloimonos, Y.4
|