SCOPUS 정보 검색 플랫폼

Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

Volumn 2017-January, Issue , 2017, Pages 7244-7253

ViP-CNN: Visual phrase guided convolutional neural network

(4) Li, Yikang a Ouyang, Wanli a,b Wang, Xiaogang a Tang, Xiao'ou a,c

a CHINESE UNIVERSITY OF HONG KONG (Hong Kong)

b UNIVERSITY OF SYDNEY (Australia)

c SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY (China)

Author keywords

[No Author keywords available]

Indexed keywords

COMPUTER VISION; CONVOLUTION; MESSAGE PASSING; NEURAL NETWORKS;

CONVOLUTIONAL NEURAL NETWORK; IMAGE CAPTIONING; INTERMEDIATE LEVEL; MODEL TRAINING; NON-MAXIMUM SUPPRESSION; PAIRWISE INTERACTION; STATE-OF-ART METHODS; THREE COMPONENT;

OBJECT DETECTION;

EID: 85041906062 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/CVPR.2017.766 Document Type: Conference Paper

Times cited : (224)

References (51)

1
- 84866688216
- Measuring the objectness of image windows
- B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. TPAMI, 2012.
- (2012) TPAMI
- Alexe, B.¹ Deselaers, T.² Ferrari, V.³

2
- 84973890960
- Vqa: Visual question answering
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.
- (2015) ICCV
- Antol, S.¹ Agrawal, A.² Lu, J.³ Mitchell, M.⁴ Batra, D.⁵ Lawrence Zitnick, C.⁶ Parikh, D.⁷

3
- 85040312797
- Y. Atzmon, J. Berant, V. Kezami, A. Globerson, and G. Chechik. Learning to generalize to new compositions in image understanding. arXiv preprint arXiv:1608.07639, 2016.
- (2016) Learning to Generalize to New Compositions in Image Understanding
- Atzmon, Y.¹ Berant, J.² Kezami, V.³ Globerson, A.⁴ Chechik, G.⁵

4
- 0041876117
- Matching words and pictures
- K. Barnard, P. Duygulu, D. Forsyth, N. d. Freitas, D. M. Blei, and M. I. Jordan. Matching words and pictures. JMLR, 2003.
- (2003) JMLR
- Barnard, K.¹ Duygulu, P.² Forsyth, D.³ Freitas, N.D.⁴ Blei, D.M.⁵ Jordan, M.I.⁶

5
- 84944115859
- X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.
- (2014) Learning a Recurrent Visual Representation for Image Caption Generation
- Chen, X.¹ Zitnick, C.L.²

6
- 85006017265
- X. Chu, W. Ouyang, H. Li, and X. Wang. Structured feature learning for pose estimation. arXiv preprint arXiv:1603.09065, 2016.
- (2016) Structured Feature Learning for Pose Estimation
- Chu, X.¹ Ouyang, W.² Li, H.³ Wang, X.⁴

7
- 85018891340
- Crf-cnn: Modeling structured information in human pose estimation
- X. Chu, W. Ouyang, X. Wang, et al. Crf-cnn: Modeling structured information in human pose estimation. In NIPS, 2016.
- (2016) NIPS
- Chu, X.¹ Ouyang, W.² Wang, X.³

8
- 84877748784
- Detecting actions, poses, and objects with relational phraselets
- C. Desai and D. Ramanan. Detecting actions, poses, and objects with relational phraselets. In ECCV, 2012.
- (2012) ECCV
- Desai, C.¹ Ramanan, D.²

9
- 84959236502
- Long-term recurrent convolutional networks for visual recognition and description
- J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
- (2015) CVPR
- Donahue, J.¹ Anne Hendricks, L.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

10
- 84959236502
- Long-term recurrent convolutional networks for visual recognition and description
- J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
- (2015) CVPR
- Donahue, J.¹ Anne Hendricks, L.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

11
- 84959250180
- From captions to visual concepts and back
- H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In CVPR, 2015.
- (2015) CVPR
- Fang, H.¹ Gupta, S.² Iandola, F.³ Srivastava, R.K.⁴ Deng, L.⁵ Dollár, P.⁶ Gao, J.⁷ He, X.⁸ Mitchell, M.⁹ Platt, J.C.¹⁰

12
- 80052017343
- Every picture tells a story: Generating sentences from images
- A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV, 2010.
- (2010) ECCV
- Farhadi, A.¹ Hejrati, M.² Sadeghi, M.A.³ Young, P.⁴ Rashtchian, C.⁵ Hockenmaier, J.⁶ Forsyth, D.⁷

13
- 85029359197
- Fast r-cnn
- R. Girshick. Fast r-cnn. In ICCV, 2015.
- (2015) ICCV
- Girshick, R.¹

14
- 84911400494
- Rich feature hierarchies for accurate object detection and semantic segmentation
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- (2014) CVPR
- Girshick, R.¹ Donahue, J.² Darrell, T.³ Malik, J.⁴

15
- 84959195179
- Deformable part models are convolutional neural networks
- R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks. In CVPR, 2015.
- (2015) CVPR
- Girshick, R.¹ Iandola, F.² Darrell, T.³ Malik, J.⁴

16
- 70450155469
- Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers
- A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008.
- (2008) ECCV
- Gupta, A.¹ Davis, L.S.²

17
- 84959229874
- Spatial pyramid pooling in deep convolutional networks for visual recognition
- K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In CVPR, 2014.
- (2014) CVPR
- He, K.¹ Zhang, X.² Ren, S.³ Sun, J.⁴

18
- 0031573117
- Long short-term memory
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
- (1997) Neural Computation
- Hochreiter, S.¹ Schmidhuber, J.²

19
- 84856653718
- Learning cross-modality similarity for multinomial data
- Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality similarity for multinomial data. In ICCV, 2011.
- (2011) ICCV
- Jia, Y.¹ Salzmann, M.² Darrell, T.³

20
- 85009867858
- Caffe: Convolutional architecture for fast feature embedding
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
- (2014) ACM MM
- Jia, Y.¹ Shelhamer, E.² Donahue, J.³ Karayev, S.⁴ Long, J.⁵ Girshick, R.⁶ Guadarrama, S.⁷ Darrell, T.⁸

21
- 84986278097
- J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. arXiv preprint arXiv:1511.07571, 2015.
- (2015) Densecap: Fully Convolutional Localization Networks for Dense Captioning
- Johnson, J.¹ Karpathy, A.² Fei-Fei, L.³

22
- 85041925966
- Object detection in videos with tubelet proposal networks
- K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, and X. Wang. Object detection in videos with tubelet proposal networks. In CVPR, 2017.
- (2017) CVPR
- Kang, K.¹ Li, H.² Xiao, T.³ Ouyang, W.⁴ Yan, J.⁵ Liu, X.⁶ Wang, X.⁷

23
- 84986301354
- K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. arXiv preprint arXiv:1604.02532, 2016.
- (2016) T-cnn: Tubelets with Convolutional Neural Networks for Object Detection from Videos
- Kang, K.¹ Li, H.² Yan, J.³ Zeng, X.⁴ Yang, B.⁵ Xiao, T.⁶ Zhang, C.⁷ Wang, Z.⁸ Wang, R.⁹ Wang, X.¹⁰

24
- 84986331475
- Object detection from video tubelets with convolutional neural networks
- K. Kang, W. Ouyang, H. Li, and X. Wang. Object detection from video tubelets with convolutional neural networks. In CVPR, 2016.
- (2016) CVPR
- Kang, K.¹ Ouyang, W.² Li, H.³ Wang, X.⁴

25
- 84946734827
- Deep visual-semantic alignments for generating image descriptions
- A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
- (2015) CVPR
- Karpathy, A.¹ Fei-Fei, L.²

26
- 85162351107
- Efficient inference in fully connected crfs with Gaussian edge potentials
- V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. NIPS, 2011.
- (2011) NIPS
- Koltun, V.¹

27
- 84978730111
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016.
- (2016) Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
- Krishna, R.¹ Zhu, Y.² Groth, O.³ Johnson, J.⁴ Hata, K.⁵ Kravitz, J.⁶ Chen, S.⁷ Kalantidis, Y.⁸ Li, L.-J.⁹ Shamma, D.A.¹⁰

28
- 84876231242
- Imagenet classification with deep convolutional neural networks
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097-1105, 2012.
- (2012) NIPS , pp. 1097-1105
- Krizhevsky, A.¹ Sutskever, I.² Hinton, G.E.³

29
- 84887601544
- Babytalk: Understanding and generating simple image descriptions
- G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Babytalk: Understanding and generating simple image descriptions. TPAMI, 2013.
- (2013) TPAMI
- Kulkarni, G.¹ Premraj, V.² Ordonez, V.³ Dhar, S.⁴ Li, S.⁵ Choi, Y.⁶ Berg, A.C.⁷ Berg, T.L.⁸

30
- 77955997860
- Efficiently selecting regions for scene understanding
- M. P. Kumar and D. Koller. Efficiently selecting regions for scene understanding. In CVPR, 2010.
- (2010) CVPR
- Kumar, M.P.¹ Koller, D.²

31
- 84907331257
- Generalizing image captions for image-text parallel corpus
- P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi. Generalizing image captions for image-text parallel corpus. In ACL, 2013.
- (2013) ACL
- Kuznetsova, P.¹ Ordonez, V.² Berg, A.C.³ Berg, T.L.⁴ Choi, Y.⁵

32
- 85011302702
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. Ssd: Single shot multibox detector. arXiv preprint arXiv:1512.02325, 2015.
- (2015) Ssd: Single Shot Multibox Detector
- Liu, W.¹ Anguelov, D.² Erhan, D.³ Szegedy, C.⁴ Reed, S.⁵

33
- 85035234967
- Visual relationship detection with language priors
- C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In ECCV, 2016.
- (2016) ECCV
- Lu, C.¹ Krishna, R.² Bernstein, M.³ Fei-Fei, L.⁴

34
- 85083951332
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- (2013) Efficient Estimation of Word Representations in Vector Space
- Mikolov, T.¹ Chen, K.² Corrado, G.³ Dean, J.⁴

35
- 84948382785
- Deepid-net: Deformable deep convolutional neural networks for object detection
- W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy, et al. Deepid-net: Deformable deep convolutional neural networks for object detection. In CVPR, 2015.
- (2015) CVPR
- Ouyang, W.¹ Wang, X.² Zeng, X.³ Qiu, S.⁴ Luo, P.⁵ Tian, Y.⁶ Li, H.⁷ Yang, S.⁸ Wang, Z.⁹ Loy, C.-C.¹⁰

36
- 84961917629
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640, 2015.
- (2015) You Only Look Once: Unified, Real-time Object Detection
- Redmon, J.¹ Divvala, S.² Girshick, R.³ Farhadi, A.⁴

37
- 84960980241
- Faster r-cnn: Towards real-time object detection with region proposal networks
- S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
- (2015) NIPS
- Ren, S.¹ He, K.² Girshick, R.³ Sun, J.⁴

38
- 84986327251
- A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. arXiv preprint arXiv:1511.03745, 2015.
- (2015) Grounding of Textual Phrases in Images by Reconstruction
- Rohrbach, A.¹ Rohrbach, M.² Hu, R.³ Darrell, T.⁴ Schiele, B.⁵

39
- 84947041871
- Imagenet large scale visual recognition challenge
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 2015.
- (2015) IJCV
- Russakovsky, O.¹ Deng, J.² Su, H.³ Krause, J.⁴ Satheesh, S.⁵ Ma, S.⁶ Huang, Z.⁷ Karpathy, A.⁸ Khosla, A.⁹ Bernstein, M.¹⁰

40
- 33845596932
- Using multiple segmentations to discover objects and their extent in image collections
- B. C. Russell, W. T. Freeman, A. A. Efros, J. Sivic, and A. Zisserman. Using multiple segmentations to discover objects and their extent in image collections. In CVPR, 2006.
- (2006) CVPR
- Russell, B.C.¹ Freeman, W.T.² Efros, A.A.³ Sivic, J.⁴ Zisserman, A.⁵

41
- 80052889458
- Recognition using visual phrases
- M. A. Sadeghi and A. Farhadi. Recognition using visual phrases. In CVPR, 2011.
- (2011) CVPR
- Sadeghi, M.A.¹ Farhadi, A.²

42
- 84925410541
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- (2014) Very Deep Convolutional Networks for Large-scale Image Recognition
- Simonyan, K.¹ Zisserman, A.²

43
- 84925410541
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- (2014) Very Deep Convolutional Networks for Large-scale Image Recognition
- Simonyan, K.¹ Zisserman, A.²

44
- 77955998009
- Connecting modalities: Semisupervised segmentation and annotation of images using unaligned text corpora
- R. Socher and L. Fei-Fei. Connecting modalities: Semisupervised segmentation and annotation of images using unaligned text corpora. In CVPR, 2010.
- (2010) CVPR
- Socher, R.¹ Fei-Fei, L.²

45
- 0000903748
- Generalization of backpropagation with application to a recurrent gas market model
- P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1988.
- (1988) Neural Networks
- Werbos, P.J.¹

46
- 84939821074
- K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.
- (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Xu, K.¹ Ba, J.² Kiros, R.³ Cho, K.⁴ Courville, A.⁵ Salakhutdinov, R.⁶ Zemel, R.S.⁷ Bengio, Y.⁸

47
- 85041926785
- L. Yang, K. Tang, J. Yang, and L.-J. Li. Dense captioning with joint inference and visual context. arXiv preprint arXiv:1611.06949, 2016.
- (2016) Dense Captioning with Joint Inference and Visual Context
- Yang, L.¹ Tang, K.² Yang, J.³ Li, L.-J.⁴

48
- 84998809208
- Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. arXiv preprint arXiv:1511.02274, 2015.
- (2015) Stacked Attention Networks for Image Question Answering
- Yang, Z.¹ He, X.² Gao, J.³ Deng, L.⁴ Smola, A.⁵

49
- 84995439884
- Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. arXiv preprint arXiv:1603.03925, 2016.
- (2016) Image Captioning with Semantic Attention
- You, Q.¹ Jin, H.² Wang, Z.³ Fang, C.⁴ Luo, J.⁵

50
- 84973861983
- Conditional random fields as recurrent neural networks
- S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional random fields as recurrent neural networks. In ICCV, 2015.
- (2015) ICCV
- Zheng, S.¹ Jayasumana, S.² Romera-Paredes, B.³ Vineet, V.⁴ Su, Z.⁵ Du, D.⁶ Huang, C.⁷ Torr, P.H.⁸

51
- 84990038229
- Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w: Grounded question answering in images. arXiv preprint arXiv:1511.03416, 2015.
- (2015) Visual7w: Grounded Question Answering in Images
- Zhu, Y.¹ Groth, O.² Bernstein, M.³ Fei-Fei, L.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.