SCOPUS 정보 검색 플랫폼

Proceedings of the IEEE International Conference on Computer Vision

Volumn 2015 International Conference on Computer Vision, ICCV 2015, Issue , 2015, Pages 19-27

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

(7) Zhu, Yukun a Kiros, Ryan a Zemel, Rich a Salakhutdinov, Ruslan a Urtasun, Raquel a Torralba, Antonio b Fidler, Sanja a

a UNIVERSITY OF TORONTO (Canada)

b MASSACHUSETTS INSTITUTE OF TECHNOLOGY (United States)

Author keywords

[No Author keywords available]

Indexed keywords

COMPUTER VISION; SEMANTICS;

CONTEXT-AWARE; FINE GRAINED; HIGH LEVEL SEMANTICS; LARGE CORPORA; MOVIE CLIPS; MULTIPLE SOURCE; VISUAL CONTENT;

MOTION PICTURES;

EID: 84973911532 PISSN: 15505499 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/ICCV.2015.11 Document Type: Conference Paper

Times cited : (2647)

References (42)

1
- 85083953689
- Neural machine translation by jointly learning to align and translate
- D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR, 2015. 4
- (2015) ICLR , vol.4
- Bahdanau, D.¹ Cho, K.² Bengio, Y.³

2
- 84943800045
- Weakly supervised action labeling in videos under ordering constraints
- P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic. Weakly supervised action labeling in videos under ordering constraints. In ECCV, 2014. 2
- (2014) ECCV , vol.2
- Bojanowski, P.¹ Lajugie, R.² Bach, F.³ Laptev, I.⁴ Ponce, J.⁵ Schmid, C.⁶ Sivic, J.⁷

3
- 84961291190
- Learning phrase representations using rnn encoderdecoder for statistical machine translation
- K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. EMNLP, 2014. 4
- (2014) EMNLP , vol.4
- Cho, K.¹ Van Merrienboer, B.² Gulcehre, C.³ Bougares, F.⁴ Schwenk, H.⁵ Bengio, Y.⁶

4
- 84973896192
- arXiv preprint arXiv :1412. 3555
- J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv preprint arXiv: 1412. 3555, 2014. 4
- (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , vol.4
- Chung, J.¹ Gulcehre, C.² Cho, K.³ Bengio, Y.⁴

5
- 70450145539
- Movie/script: Alignment and parsing of video and text transcription
- T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar. Movie/script: Alignment and parsing of video and text transcription. In ECCV, 2008. 2
- (2008) ECCV , vol.2
- Cour, T.¹ Jordan, C.² Miltsakaki, E.³ Taskar, B.⁴

6
- 84898027861
- Hello! My name is Buffy-Automatic Naming of Characters in TV Video
- M. Everingham, J. Sivic, and A. Zisserman. "Hello! My name is. Buffy"-Automatic Naming of Characters in TV Video. BMVC, pages 899-908, 2006. 2
- (2006) BMVC , vol.2 , pp. 899-908
- Everingham, M.¹ Sivic, J.² Zisserman, A.³

7
- 80051961229
- Every picture tells a story: Generating sentences for images
- A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences for images. In ECCV, 2010. 2
- (2010) ECCV , vol.2
- Farhadi, A.¹ Hejrati, M.² Sadeghi, M.³ Young, P.⁴ Rashtchian, C.⁵ Hockenmaier, J.⁶ Forsyth, D.⁷

8
- 84887365305
- A sentence is worth a thousand pixels
- S. Fidler, A. Sharma, and R. Urtasun. A sentence is worth a thousand pixels. In CVPR, 2013. 2
- (2013) CVPR , vol.2
- Fidler, S.¹ Sharma, A.² Urtasun, R.³

9
- 57149125139
- Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers
- A. Gupta and L. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 1
- (2008) ECCV , vol.1
- Gupta, A.¹ Davis, L.²

10
- 0031573117
- Long short-term memory
- 4
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 (8): 1735-1780, 1997. 4
- (1997) Neural Computation , vol.9 , Issue.8 , pp. 1735-1780
- Hochreiter, S.¹ Schmidhuber, J.²

11
- 84883394520
- Framing image description as a ranking task: Data, models and evaluation metrics
- 2
- M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 47: 853-899, 2013. 2
- (2013) JAIR , vol.47 , pp. 853-899
- Hodosh, M.¹ Young, P.² Hockenmaier, J.³

12
- 84926283798
- Recurrent continuous translation models
- N. Kalchbrenner and P. Blunsom. Recurrent continuous translation models. In EMNLP, pages 1700-1709, 2013. 4
- (2013) EMNLP , vol.4 , pp. 1700-1709
- Kalchbrenner, N.¹ Blunsom, P.²

13
- 84952902559
- Deep visual-semantic alignments for generating image descriptions
- A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015. 1, 2
- (2015) CVPR , vol.1 , pp. 2
- Karpathy, A.¹ Fei-Fei, L.²

14
- 84941620184
- arXiv preprint arXiv: 1412. 6980
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. ArXiv preprint arXiv: 1412. 6980, 2014. 5
- (2014) Adam: A Method for Stochastic Optimization , pp. 5
- Kingma, D.¹ Ba, J.²

15
- 84973927487
- 1, 2, 3, 5, 7, 8. CoRR, abs/ 1411
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visualsemantic embeddings with multimodal neural language models. CoRR, abs/1411. 2539, 2014. 1, 2, 3, 5, 7, 8
- (2014) Unifying Visualsemantic Embeddings with Multimodal Neural Language Models , vol.2539
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.S.³

16
- 84973930208
- arXiv preprint arXiv
- R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thought vectors. ArXiv preprint arXiv, 2015. 3, 4
- (2015) Skip-thought Vectors , vol.3 , pp. 4
- Kiros, R.¹ Zhu, Y.² Salakhutdinov, R.³ Zemel, R.⁴ Torralba, A.⁵ Urtasun, R.⁶ Fidler, S.⁷

17
- 84911370987
- What are you talking about text-to-image coreference
- C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about text-to-image coreference. In CVPR, 2014. 1, 2
- (2014) CVPR , vol.1 , pp. 2
- Kong, C.¹ Lin, D.² Bansal, M.³ Urtasun, R.⁴ Fidler, S.⁵

18
- 80052901011
- Baby talk: Understanding and generating simple image descriptions
- G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, and T. Berg. Baby talk: Understanding and generating simple image descriptions. In CVPR, 2011. 2
- (2011) CVPR , vol.2
- Kulkarni, G.¹ Premraj, V.² Dhar, S.³ Li, S.⁴ Choi, Y.⁵ Berg, A.⁶ Berg, T.⁷

19
- 84911442106
- Visual semantic search: Retrieving videos via complex textual queries
- 2
- D. Lin, S. Fidler, C. Kong, and R. Urtasun. Visual Semantic Search: Retrieving Videos via Complex Textual Queries. CVPR, pages 2657-2664, 2014. 1, 2
- (2014) CVPR , vol.1 , pp. 2657-2664
- Lin, D.¹ Fidler, S.² Kong, C.³ Urtasun, R.⁴

20
- 84906493406
- Microsoft coco: Common objects in context
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740-755. 2014. 1
- (2014) ECCV , vol.1 , pp. 740-755
- Lin, T.-Y.¹ Maire, M.² Belongie, S.³ Hays, J.⁴ Perona, P.⁵ Ramanan, D.⁶ Dollár, P.⁷ Zitnick, C.L.⁸

21
- 84959227898
- Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks
- X. Lin and D. Parikh. Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In CVPR, 2015. 1
- (2015) CVPR , vol.1
- Lin, X.¹ Parikh, D.²

22
- 84937822746
- A multi-world approach to question answering about real-world scenes based on uncertain input
- M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, 2014. 1
- (2014) NIPS , vol.1
- Malinowski, M.¹ Fritz, M.²

23
- 84959916685
- Whats cookin interpreting cooking videos using text, speech and vision
- J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich, and K. Murphy. Whats Cookin Interpreting Cooking Videos using Text, Speech and Vision. In NAACL, 2015. 2
- (2015) NAACL , vol.2
- Malmaud, J.¹ Huang, J.² Rathod, V.³ Johnston, N.⁴ Rabinovich, A.⁵ Murphy, K.⁶

24
- 84973925553
- arXiv: 1410. 1090. 2
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. In arXiv: 1410. 1090, 2014. 1, 2
- (2014) Explain Images with Multimodal Recurrent Neural Networks , vol.1
- Mao, J.¹ Xu, W.² Yang, Y.³ Wang, J.⁴ Yuille, A.L.⁵

25
- 85083951332
- arXiv preprint arXiv: 1301. 3781. 3, 7
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. ArXiv preprint arXiv: 1301. 3781, 2013. 3, 7
- (2013) Efficient Estimation of Word Representations in Vector Space
- Mikolov, T.¹ Chen, K.² Corrado, G.³ Dean, J.⁴

26
- 85133336275
- BLEU: A method for automatic evaluation of machine translation
- K. Papineni, S. Roukos, T. Ward, andW. J. Zhu. BLEU: A method for automatic evaluation of machine translation. In ACL, pages 311-318, 2002. 6
- (2002) ACL , vol.6 , pp. 311-318
- Papineni, K.¹ Roukos, S.² Ward, T.³ Zhu, A.J.⁴

27
- 84943813241
- arXiv. org, jun. 2
- H. Pirsiavash, C. Vondrick, and A. Torralba. Inferring the why in images. ArXiv. org, jun 2014. 2
- (2014) Inferring the Why in Images
- Pirsiavash, H.¹ Vondrick, C.² Torralba, A.³

28
- 84906510695
- Linking people in videos with their. Names using coreference resolution
- 2
- V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Linking People in Videos with "Their" Names Using Coreference Resolution. In ECCV, pages 95-110. 2014. 2
- (2014) ECCV , pp. 95-110
- Ramanathan, V.¹ Joulin, A.² Liang, P.³ Fei-Fei, L.⁴

29
- 84898775557
- Video event understanding using natural language descriptions
- V. Ramanathan, P. Liang, and L. Fei-Fei. Video event understanding using natural language descriptions. In ICCV, 2013. 1
- (2013) ICCV , vol.1
- Ramanathan, V.¹ Liang, P.² Fei-Fei, L.³

30
- 84952349302
- A dataset for movie description
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A dataset for movie description. In CVPR, 2015. 2, 5
- (2015) CVPR , vol.2 , pp. 5
- Rohrbach, A.¹ Rohrbach, M.² Tandon, N.³ Schiele, B.⁴

31
- 84898875082
- Subtitle-free Movie to Script Alignment
- P. Sankar, C. V. Jawahar, and A. Zisserman. Subtitle-free Movie to Script Alignment. In BMVC, 2009. 2
- (2009) BMVC , vol.2
- Sankar, P.¹ Jawahar, C.V.² Zisserman, A.³

32
- 84867113207
- Efficient structured prediction with latent variables for general graphical models
- A. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Efficient Structured Prediction with Latent Variables for General Graphical Models. In ICML, 2012. 7
- (2012) ICML , vol.7
- Schwing, A.¹ Hazan, T.² Pollefeys, M.³ Urtasun, R.⁴

33
- 70450202706
- Who are you"-Learning person specific classifiers from video
- 2
- J. Sivic, M. Everingham, and A. Zisserman. "Who are you"-Learning person specific classifiers from video. CVPR, pages 1145-1152, 2009. 2
- (2009) CVPR , pp. 1145-1152
- Sivic, J.¹ Everingham, M.² Zisserman, A.³

34
- 84964474107
- Grounded compositional semantics for finding and describing images with sentences
- R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng. Grounded compositional semantics for finding and describing images with sentences. ACL, 2: 207-218, 2014. 2
- (2014) ACL , vol.2 , pp. 207-218
- Socher, R.¹ Karpathy, A.² Le, Q.V.³ Manning, C.D.⁴ Ng, A.Y.⁵

35
- 84928547704
- Sequence to sequence learning with neural networks
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014. 4
- (2014) NIPS , vol.4
- Sutskever, I.¹ Vinyals, O.² Le, Q.V.³

36
- 84964983441
- arXiv preprint arXiv: 1409. 4842. 5
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. ArXiv preprint arXiv: 1409. 4842, 2014. 5
- (2014) Going Deeper with Convolutions
- Szegedy, C.¹ Liu, W.² Jia, Y.³ Sermanet, P.⁴ Reed, S.⁵ Anguelov, D.⁶ Erhan, D.⁷ Vanhoucke, V.⁸ Rabinovich, A.⁹

37
- 84959255361
- Book2Movie: Aligning Video scenes with Book chapters
- M. Tapaswi, M. Bauml, and R. Stiefelhagen. Book2Movie: Aligning Video scenes with Book chapters. In CVPR, 2015. 2
- (2015) CVPR , vol.2
- Tapaswi, M.¹ Bauml, M.² Stiefelhagen, R.³

38
- 84977834021
- Aligning plot synopses to videos for story-based retrieval
- 1, 2, 6
- M. Tapaswi, M. Buml, and R. Stiefelhagen. Aligning Plot Synopses to Videos for Story-based Retrieval. IJMIR, 4: 3-16, 2015. 1, 2, 6
- (2015) IJMIR , vol.4 , pp. 3-16
- Tapaswi, M.¹ Buml, M.² Stiefelhagen, R.³

39
- 84944069490
- CoRR abs/ 1312. 6229, cs. CV. 1, 2
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. CoRR abs/1312. 6229, cs. CV, 2014. 1, 2
- (2014) Translating Videos to Natural Language Using Deep Recurrent Neural Networks
- Venugopalan, S.¹ Xu, H.² Donahue, J.³ Rohrbach, M.⁴ Mooney, R.J.⁵ Saenko, K.⁶

40
- 84939821075
- arXiv: 1411. 4555. 1, 2
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In arXiv: 1411. 4555, 2014. 1, 2
- (2014) Show and Tell: A Neural Image Caption Generator
- Vinyals, O.¹ Toshev, A.² Bengio, S.³ Erhan, D.⁴

41
- 84939821074
- arXiv: 1502. 03044. 2
- K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In arXiv: 1502. 03044, 2015. 2
- (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Xu, K.¹ Ba, J.² Kiros, R.³ Cho, K.⁴ Courville, A.⁵ Salakhutdinov, R.⁶ Zemel, R.⁷ Bengio, Y.⁸

42
- 85015194053
- Learning deep features for scene recognition using places database
- B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for Scene Recognition using Places Database. In NIPS, 2014. 5, 8
- (2014) NIPS , vol.5 , pp. 8
- Zhou, B.¹ Lapedriza, A.² Xiao, J.³ Torralba, A.⁴ Oliva, A.⁵

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.