SCOPUS 정보 검색 플랫폼

Proceedings of the IEEE International Conference on Computer Vision

Volumn 2015 International Conference on Computer Vision, ICCV 2015, Issue , 2015, Pages 4534-4542

Sequence to sequence - Video to text

(6) Venugopalan, Subhashini a Rohrbach, Marcus b,d Donahue, Jeffrey b Mooney, Raymond a Darrell, Trevor b Saenko, Kate c

a UNIVERSITY OF TEXAS AT AUSTIN (United States)

b UNIVERSITY OF CALIFORNIA (United States)

c UNIVERSITY OF MASSACHUSETTS (United States)

d INTERNATIONAL COMPUTER SCIENCE INSTITUTE (United States)

Author keywords

[No Author keywords available]

Indexed keywords

COMPLEX NETWORKS; RECURRENT NEURAL NETWORKS;

COMPLEX DYNAMICS; LANGUAGE MODEL; REAL WORLD VIDEOS; SEQUENCE MODELING; STATE-OF-THE-ART PERFORMANCE; TEMPORAL STRUCTURES; VARIABLE LENGTH; VISUAL FEATURE;

COMPUTER VISION;

EID: 84973882730 PISSN: 15505499 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/ICCV.2015.515 Document Type: Conference Paper

Times cited : (1483)

References (44)

1
- 77951155435
- Video2text: Learning to annotate video content
- H. Aradhye, G. Toderici, and J. Yagnik. Video2text: Learning to annotate video content. In ICDMW, 2009.
- (2009) ICDMW
- Aradhye, H.¹ Toderici, G.² Yagnik, J.³

2
- 35048833329
- High accuracy optical flow estimation based on a theory for warping
- T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV, pages 25-36, 2004.
- (2004) ECCV , pp. 25-36
- Brox, T.¹ Bruhn, A.² Papenberg, N.³ Weickert, J.⁴

3
- 84859089502
- Collecting highly parallel data for paraphrase evaluation
- D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
- (2011) ACL
- Chen, D.L.¹ Dolan, W.B.²

4
- 84952349295
- arXiv:1504. 00325
- X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dol-lar, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. ArXiv:1504. 00325, 2015.
- (2015) Microsoft COCO Captions: Data Collection and Evaluation Server
- Chen, X.¹ Fang, H.² Lin, T.-Y.³ Vedantam, R.⁴ Gupta, S.⁵ Dol-Lar, P.⁶ Zitnick, C.L.⁷

5
- 84957029470
- Learning a recurrent visual representation for image caption generation
- X. Chen and C. L. Zitnick. Learning a recurrent visual representation for image caption generation. CVPR, 2015.
- (2015) CVPR
- Chen, X.¹ Zitnick, C.L.²

6
- 84943799837
- arXiv:1409. 1259
- K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoderdecoder approaches. ArXiv:1409. 1259, 2014.
- (2014) On the Properties of Neural Machine Translation: Encoderdecoder Approaches
- Cho, K.¹ Van Merriënboer, B.² Bahdanau, D.³ Bengio, Y.⁴

7
- 85107661995
- Meteor universal: Language specific translation evaluation for any target language
- M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In EACL, 2014.
- (2014) EACL
- Denkowski, M.¹ Lavie, A.²

8
- 84959236502
- Long-term recurrent convolutional networks for visual recognition and description
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
- (2015) CVPR
- Donahue, J.¹ Hendricks, L.A.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

9
- 84977644905
- G. Gkioxari and J. Malik. Finding action tubes. 2014.
- (2014) Finding Action Tubes.
- Gkioxari, G.¹ Malik, J.²

10
- 84919832465
- Towards end-to-end speech recognition with recurrent neural networks
- A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, 2014.
- (2014) ICML
- Graves, A.¹ Jaitly, N.²

11
- 84898773262
- Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition
- S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shoot recognition. In ICCV, 2013.
- (2013) ICCV
- Guadarrama, S.¹ Krishnamoorthy, N.² Malkarnenkar, G.³ Venugopalan, S.⁴ Mooney, R.⁵ Darrell, T.⁶ Saenko, K.⁷

12
- 0031573117
- Long short-term memory
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8), 1997.
- (1997) Neural Computation , vol.9 , Issue.8
- Hochreiter, S.¹ Schmidhuber, J.²

13
- 84906494296
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
- P. Hodosh, A. Young, M. Lai, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In TACL, 2014.
- (2014) TACL
- Hodosh, P.¹ Young, A.² Lai, M.³ Hockenmaier, J.⁴

14
- 85072312550
- A multi-modal clustering method for web videos
- H. Huang, Y. Lu, F. Zhang, and S. Sun. A multi-modal clustering method for web videos. In ISCTCS. 2013.
- (2013) ISCTCS.
- Huang, H.¹ Lu, Y.² Zhang, F.³ Sun, S.⁴

15
- 84913580146
- Caffe: Convolutional architecture for fast feature embedding
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. ACMMM, 2014.
- (2014) ACMMM
- Jia, Y.¹ Shelhamer, E.² Donahue, J.³ Karayev, S.⁴ Long, J.⁵ Girshick, R.⁶ Guadarrama, S.⁷ Darrell, T.⁸

16
- 84946734827
- Deep visual-semantic alignments for generating image descriptions
- A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. CVPR, 2015.
- (2015) CVPR
- Karpathy, A.¹ Fei-Fei, L.²

17
- 84941620184
- arXiv preprint arXiv:1412. 6980
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. ArXiv preprint arXiv:1412. 6980, 2014.
- (2014) Adam: A Method for Stochastic Optimization
- Kingma, D.¹ Ba, J.²

18
- 84944113729
- arXiv:1411. 2539
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. ArXiv:1411. 2539, 2014.
- (2014) Unifying Visual-semantic Embeddings with Multimodal Neural Language Models
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.S.³

19
- 84893398951
- Generating natural-language video descriptions using text-mined knowledge
- July
- N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating natural-language video descriptions using text-mined knowledge. In AAAI, July 2013.
- (2013) AAAI
- Krishnamoorthy, N.¹ Malkarnenkar, G.² Mooney, R.J.³ Saenko, K.⁴ Guadarrama, S.⁵

20
- 84934873221
- Treetalk: Composition and compression of trees for image descriptions
- P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and Y. Choi. Treetalk: Composition and compression of trees for image descriptions. In TACL, 2014.
- (2014) TACL
- Kuznetsova, P.¹ Ordonez, V.² Berg, T.L.³ Hill, U.C.⁴ Choi, Y.⁵

21
- 84964930561
- Rouge: A package for automatic evaluation of summaries
- C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74-81, 2004.
- (2004) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop , pp. 74-81
- Lin, C.-Y.¹

22
- 84937834115
- Microsoft coco: Common objects in context
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
- (2014) ECCV
- Lin, T.-Y.¹ Maire, M.² Belongie, S.³ Hays, J.⁴ Perona, P.⁵ Ramanan, D.⁶ Dollár, P.⁷ Zitnick, C.L.⁸

23
- 84939821073
- arXiv:1412. 6632
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Deep captioning with multimodal recurrent neural networks (mrnn). ArXiv:1412. 6632, 2014.
- (2014) Deep Captioning with Multimodal Recurrent Neural Networks (Mrnn)
- Mao, J.¹ Xu, W.² Yang, Y.³ Wang, J.⁴ Yuille, A.L.⁵

24
- 84959228762
- Beyond short snippets: Deep networks for video classification
- J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. CVPR, 2015.
- (2015) CVPR
- Ng, J.Y.¹ Hausknecht, M.J.² Vijayanarasimhan, S.³ Vinyals, O.⁴ Monga, R.⁵ Toderici, G.⁶

25
- 84905274625
- TRECVID 2012-an overview of the goals, tasks, data, evaluation mechanisms and metrics
- P. Over, G. Awad, M. Michel, J. Fiscus, G. Sanders, B. Shaw, A. F. Smeaton, and G. Quéenot. TRECVID 2012-an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2012, 2012.
- (2012) Proceedings of TRECVID 2012
- Over, P.¹ Awad, G.² Michel, M.³ Fiscus, J.⁴ Sanders, G.⁵ Shaw, B.⁶ Smeaton, A.F.⁷ Quéenot, G.⁸

26
- 85133336275
- Bleu: A method for automatic evaluation of machine translation
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In ACL, 2002.
- (2002) ACL
- Papineni, K.¹ Roukos, S.² Ward, T.³ Zhu, W.-J.⁴

27
- 84973887740
- The long-short story of movie description
- A. Rohrbach, M. Rohrbach, and B. Schiele. The long-short story of movie description. GCPR, 2015.
- (2015) GCPR
- Rohrbach, A.¹ Rohrbach, M.² Schiele, B.³

28
- 84959211977
- A dataset for movie description
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. In CVPR, 2015.
- (2015) CVPR
- Rohrbach, A.¹ Rohrbach, M.² Tandon, N.³ Schiele, B.⁴

29
- 84898775239
- Translating video content to natural language descriptions
- M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating video content to natural language descriptions. In ICCV, 2013.
- (2013) ICCV
- Rohrbach, M.¹ Qiu, W.² Titov, I.³ Thater, S.⁴ Pinkal, M.⁵ Schiele, B.⁶

30
- 84909978410
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ILSVRC, 2014.
- (2014) ILSVRC
- Russakovsky, O.¹ Deng, J.² Su, H.³ Krause, J.⁴ Satheesh, S.⁵ Ma, S.⁶ Huang, Z.⁷ Karpathy, A.⁸ Khosla, A.⁹ Bernstein, M.¹⁰ Berg, A.C.¹¹ Fei-Fei, L.¹²

31
- 84937862424
- Two-stream convolutional networks for action recognition in videos
- K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
- (2014) NIPS
- Simonyan, K.¹ Zisserman, A.²

32
- 84925410541
- CoRR, abs/1409. 1556
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409. 1556, 2014.
- (2014) Very Deep Convolutional Networks for Large-scale Image Recognition
- Simonyan, K.¹ Zisserman, A.²

33
- 84969544782
- Unsupervised learning of video representations using LSTMs
- N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. ICML, 2015.
- (2015) ICML
- Srivastava, N.¹ Mansimov, E.² Salakhutdinov, R.³

34
- 84928547704
- Sequence to sequence learning with neural networks
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
- (2014) NIPS
- Sutskever, I.¹ Vinyals, O.² Le, Q.V.³

35
- 84937522268
- Going deeper with convolutions
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CVPR, 2015.
- (2015) CVPR
- Szegedy, C.¹ Liu, W.² Jia, Y.³ Sermanet, P.⁴ Reed, S.⁵ Anguelov, D.⁶ Erhan, D.⁷ Vanhoucke, V.⁸ Rabinovich, A.⁹

36
- 84959932469
- Integrating language and vision to generate natural language descriptions of videos in the wild
- J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In COLING, 2014.
- (2014) COLING
- Thomason, J.¹ Venugopalan, S.² Guadarrama, S.³ Saenko, K.⁴ Mooney, R.J.⁵

37
- 84959246420
- arXiv:1503. 01070v1
- A. Torabi, C. Pal, H. Larochelle, and A. Courville. Using descriptive video services to create a large data source for video annotation research. ArXiv:1503. 01070v1, 2015.
- (2015) Using Descriptive Video Services to Create A Large Data Source for Video Annotation Research
- Torabi, A.¹ Pal, C.² Larochelle, H.³ Courville, A.⁴

38
- 84956980995
- CIDEr: Consensus-based image description evaluation
- R. Vedantam, C. L. Zitnick, and D. Parikh. CIDEr: Consensus-based image description evaluation. CVPR, 2015.
- (2015) CVPR
- Vedantam, R.¹ Zitnick, C.L.² Parikh, D.³

39
- 84959876769
- Translating videos to natural language using deep recurrent neural networks
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACL, 2015.
- (2015) NAACL
- Venugopalan, S.¹ Xu, H.² Donahue, J.³ Rohrbach, M.⁴ Mooney, R.⁵ Saenko, K.⁶

40
- 84946747440
- Show and tell: A neural image caption generator
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. CVPR, 2015.
- (2015) CVPR
- Vinyals, O.¹ Toshev, A.² Bengio, S.³ Erhan, D.⁴

41
- 84898805910
- Action recognition with improved trajectories
- IEEE
- H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, pages 3551-3558. IEEE, 2013.
- (2013) ICCV , pp. 3551-3558
- Wang, H.¹ Schmid, C.²

42
- 77954177620
- Multimodal fusion for video search reranking
- S. Wei, Y. Zhao, Z. Zhu, and N. Liu. Multimodal fusion for video search reranking. TKDE, 2010.
- (2010) TKDE
- Wei, S.¹ Zhao, Y.² Zhu, Z.³ Liu, N.⁴

43
- 84965160010
- arXiv:1502. 08029v4
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. ArXiv:1502. 08029v4, 2015.
- (2015) Describing Videos by Exploiting Temporal Structure
- Yao, L.¹ Torabi, A.² Cho, K.³ Ballas, N.⁴ Pal, C.⁵ Larochelle, H.⁶ Courville, A.⁷

44
- 84958234084
- arXiv:1410. 4615
- W. Zaremba and I. Sutskever. Learning to execute. ArXiv:1410. 4615, 2014.
- (2014) Learning to Execute
- Zaremba, W.¹ Sutskever, I.²

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.