SCOPUS 정보 검색 플랫폼

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

Volumn 2016-December, Issue , 2016, Pages 4631-4640

MovieQA: Understanding stories in movies through question-answering

(6) Tapaswi, Makarand a Zhu, Yukun c Stiefelhagen, Rainer a Torralba, Antonio b Urtasun, Raquel c Fidler, Sanja c

a KARLSRUHE INSTITUTE OF TECHNOLOGY (Germany)

b MASSACHUSETTS INSTITUTE OF TECHNOLOGY (United States)

c UNIVERSITY OF TORONTO (Canada)

Author keywords

[No Author keywords available]

Indexed keywords

COMPUTER VISION; SEMANTICS;

DATA SET; MULTIPLE SOURCE; QUESTION ANSWERING; VIDEO CLIPS;

PATTERN RECOGNITION;

EID: 84986296727 PISSN: 10636919 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/CVPR.2016.501 Document Type: Conference Paper

Times cited : (806)

References (50)

1
- 84973890960
- VQA: Visual Question Answering
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015.
- (2015) ICCV
- Antol, S.¹ Agrawal, A.² Lu, J.³ Mitchell, M.⁴ Batra, D.⁵ Zitnick, C.L.⁶ Parikh, D.⁷

2
- 84887366672
- Semisupervised Learning with Constraints for Person Identification in Multimedia Data
- M. Baeuml, M. Tapaswi, and R. Stiefelhagen. Semisupervised Learning with Constraints for Person Identification in Multimedia Data. In CVPR, 2013.
- (2013) CVPR
- Baeuml, M.¹ Tapaswi, M.² Stiefelhagen, R.³

3
- 84885996388
- Video-in-sentences out
- A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt, J. Shangguan, J. Siskind, J. Waggoner, S. Wang, J. Wei, Y. Yin, and Z. Zhang. Video-In-sentences Out. In UAI, 2012.
- (2012) UAI
- Barbu, A.¹ Bridge, A.² Burchill, Z.³ Coroian, D.⁴ Dickinson, S.⁵ Fidler, S.⁶ Michaux, A.⁷ Mussman, S.⁸ Narayanaswamy, S.⁹ Salvi, D.¹⁰ Schmidt, L.¹¹ Shangguan, J.¹² Siskind, J.¹³ Waggoner, J.¹⁴ Wang, S.¹⁵ Wei, J.¹⁶ Yin, Y.¹⁷ Zhang, Z.¹⁸

4
- 84898792367
- Finding actors and actions in movies
- P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic. Finding Actors and Actions in Movies. ICCV, pages 2280-2287, 2013.
- (2013) ICCV , pp. 2280-2287
- Bojanowski, P.¹ Bach, F.² Laptev, I.³ Ponce, J.⁴ Schmid, C.⁵ Sivic, J.⁶

5
- 84859089502
- Collecting highly parallel data for paraphrase evaluation
- D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
- (2011) ACL
- Chen, D.L.¹ Dolan, W.B.²

6
- 84944115859
- arXiv:1411.5654
- X. Chen and C. L. Zitnick. Learning a Recurrent Visual Representation for Image Caption Generation. In arXiv:1411.5654, 2014.
- (2014) Learning A Recurrent Visual Representation for Image Caption Generation
- Chen, X.¹ Zitnick, C.L.²

7
- 70450145539
- Movie/script: Alignment and parsing of video and text transcription
- T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar. Movie/Script: Alignment and Parsing of Video and Text Transcription. In ECCV, 2008.
- (2008) ECCV
- Cour, T.¹ Jordan, C.² Miltsakaki, E.³ Taskar, B.⁴

8
- 84887345951
- A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching
- P. Das, C. Xu, R. F. Doell, and J. J. Corso. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. CVPR, 2013.
- (2013) CVPR
- Das, P.¹ Xu, C.² Doell, R.F.³ Corso, J.J.⁴

9
- 84887345951
- A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching
- P. Das, C. Xu, R. F. Doell, and J. J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, 2013.
- (2013) CVPR
- Das, P.¹ Xu, C.² Doell, R.F.³ Corso, J.J.⁴

10
- 85009912425
- arXiv:1411.4389
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In arXiv:1411.4389, 2014.
- (2014) Long-term Recurrent Convolutional Networks for Visual Recognition and Description
- Donahue, J.¹ Hendricks, L.A.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

11
- 80051961229
- Every picture tells a story: Generating sentences for images
- A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every Picture Tells a Story: Generating Sentences for Images. In ECCV, 2010.
- (2010) ECCV
- Farhadi, A.¹ Hejrati, M.² Sadeghi, M.³ Young, P.⁴ Rashtchian, C.⁵ Hockenmaier, J.⁶ Forsyth, D.⁷

12
- 84959928474
- arXiv:1506.03340
- K. M. Hermann, T. Ko?cisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching Machines to Read and Comprehend. In arXiv:1506.03340, 2015.
- (2015) Teaching Machines to Read and Comprehend
- Hermann, K.M.¹ Kocisky, T.² Grefenstette, E.³ Espeholt, L.⁴ Kay, W.⁵ Suleyman, M.⁶ Blunsom, P.⁷

13
- 84946734827
- Deep visual-semantic alignments for generating image descriptions
- A. Karpathy and L. Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR, 2015.
- (2015) CVPR
- Karpathy, A.¹ Fei-Fei, L.²

14
- 84941620184
- arXiv:1412.6980
- D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
- (2014) Adam: A Method for Stochastic Optimization
- Kingma, D.¹ Ba, J.²

15
- 84952349298
- Unifying visual-semantic embeddings with multimodal neural language models
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. TACL, 2015.
- (2015) TACL
- Kiros, R.¹ Salakhutdinov, R.² Zemel, R.S.³

16
- 84965153327
- Skip-Thought Vectors
- R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-Thought Vectors. NIPS, 2015.
- (2015) NIPS
- Kiros, R.¹ Zhu, Y.² Salakhutdinov, R.³ Zemel, R.⁴ Torralba, A.⁵ Urtasun, R.⁶ Fidler, S.⁷

17
- 84911370987
- What are you talking about? Text-to-image coreference
- C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about? Text-to-Image Coreference. In CVPR, 2014.
- (2014) CVPR
- Kong, C.¹ Lin, D.² Bansal, M.³ Urtasun, R.⁴ Fidler, S.⁵

18
- 84893398951
- Generating natural-language video descriptions using text-mined knowledge
- July
- N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge. In AAAI, July 2013.
- (2013) AAAI
- Krishnamoorthy, N.¹ Malkarnenkar, G.² Mooney, R.J.³ Saenko, K.⁴ Guadarrama, S.⁵

19
- 80052901011
- Baby talk: Understanding and generating simple image descriptions
- G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, and T. Berg. Baby Talk: Understanding and Generating Simple Image Descriptions. In CVPR, 2011.
- (2011) CVPR
- Kulkarni, G.¹ Premraj, V.² Dhar, S.³ Li, S.⁴ Choi, Y.⁵ Berg, A.⁶ Berg, T.⁷

20
- 84877085938
- Learning dependencybased compositional semantics
- P. Liang, M. Jordan, and D. Klein. Learning dependencybased compositional semantics. In Computational Linguistics, 2013.
- (2013) Computational Linguistics
- Liang, P.¹ Jordan, M.² Klein, D.³

21
- 84911442106
- Visual semantic search: Retrieving videos via complex textual queries
- D. Lin, S. Fidler, C. Kong, and R. Urtasun. Visual Semantic Search: Retrieving Videos via Complex Textual Queries. CVPR, 2014.
- (2014) CVPR
- Lin, D.¹ Fidler, S.² Kong, C.³ Urtasun, R.⁴

22
- 85009931853
- Microsoft COCO: Common Objects in Context
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. In ECCV. 2014.
- (2014) ECCV.
- Lin, T.-Y.¹ Maire, M.² Belongie, S.³ Hays, J.⁴ Perona, P.⁵ Ramanan, D.⁶ Dollár, P.⁷ Zitnick, C.L.⁸

23
- 84937822746
- A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input
- M. Malinowski and M. Fritz. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input. In NIPS, 2014.
- (2014) NIPS
- Malinowski, M.¹ Fritz, M.²

24
- 84973896625
- Ask your neurons: A neural-based approach to answering questions about images
- M. Malinowski, M. Rohrbach, and M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images. In ICCV, 2015.
- (2015) ICCV
- Malinowski, M.¹ Rohrbach, M.² Fritz, M.³

25
- 85083951332
- arXiv preprint arXiv:1301.3781
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- (2013) Efficient Estimation of Word Representations in Vector Space
- Mikolov, T.¹ Chen, K.² Corrado, G.³ Dean, J.⁴

26
- 85162522202
- Im2Text: Describing images using 1 million captioned photographs
- V. Ordonez, G. Kulkarni, and T. Berg. Im2Text: Describing Images Using 1 Million Captioned Photographs. In NIPS, 2011.
- (2011) NIPS
- Ordonez, V.¹ Kulkarni, G.² Berg, T.³

27
- 84943813241
- arXiv.org, jun
- H. Pirsiavash, C. Vondrick, and A. Torralba. Inferring the Why in Images. arXiv.org, jun 2014.
- (2014) Inferring the Why in Images
- Pirsiavash, H.¹ Vondrick, C.² Torralba, A.³

28
- 84943782750
- Linking People in Videos with" Their" Names Using Coreference Resolution
- V. Ramanathan, A. Joulin, P. Liang, and L. Fei-Fei. Linking People in Videos with "Their" Names Using Coreference Resolution. In ECCV. 2014.
- (2014) ECCV.
- Ramanathan, V.¹ Joulin, A.² Liang, P.³ Fei-Fei, L.⁴

29
- 84898775557
- Video Event Understanding using Natural Language Descriptions
- V. Ramanathan, P. Liang, and L. Fei-Fei. Video Event Understanding using Natural Language Descriptions. In ICCV, 2013.
- (2013) ICCV
- Ramanathan, V.¹ Liang, P.² Fei-Fei, L.³

30
- 84986256401
- arXiv:1505.02074
- M. Ren, R. Kiros, and R. Zemel. Exploring Models and Data for Image Question Answering. arXiv:1505.02074, 2015.
- (2015) Exploring Models and Data for Image Question Answering
- Ren, M.¹ Kiros, R.² Zemel, R.³

31
- 84926345282
- Mctest: A challenge dataset for the open-domain machine comprehension of text
- M. Richardson, C. J. Burges, and E. Renshaw. Mctest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP, 2013.
- (2013) EMNLP
- Richardson, M.¹ Burges, C.J.² Renshaw, E.³

32
- 84959211977
- A dataset for movie description
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele. A Dataset for Movie Description. In CVPR, 2015.
- (2015) CVPR
- Rohrbach, A.¹ Rohrbach, M.² Tandon, N.³ Schiele, B.⁴

33
- 84898775239
- Translating video content to natural language descriptions
- M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele. Translating Video Content to Natural Language Descriptions. In ICCV, 2013.
- (2013) ICCV
- Rohrbach, M.¹ Qiu, W.² Titov, I.³ Thater, S.⁴ Pinkal, M.⁵ Schiele, B.⁶

34
- 84898875082
- Subtitle-free Movie to Script Alignment
- P. Sankar, C. V. Jawahar, and A. Zisserman. Subtitle-free Movie to Script Alignment. In BMVC, 2009.
- (2009) BMVC
- Sankar, P.¹ Jawahar, C.V.² Zisserman, A.³

35
- 70450202706
- Who are you?"-Learning person specific classifiers from video
- J. Sivic, M. Everingham, and A. Zisserman. "Who are you?"-Learning person specific classifiers from video. CVPR, pages 1145-1152, 2009.
- (2009) CVPR , pp. 1145-1152
- Sivic, J.¹ Everingham, M.² Zisserman, A.³

36
- 84964875108
- arXiv:1503.08895
- S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-To-End Memory Networks. In arXiv:1503.08895, 2015.
- (2015) End-To-End Memory Networks
- Sukhbaatar, S.¹ Szlam, A.² Weston, J.³ Fergus, R.⁴

37
- 85009879494
- arXiv:1409.4842
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv:1409.4842, 2014.
- (2014) Going Deeper with Convolutions
- Szegedy, C.¹ Liu, W.² Jia, Y.³ Sermanet, P.⁴ Reed, S.⁵ Anguelov, D.⁶ Erhan, D.⁷ Vanhoucke, V.⁸ Rabinovich, A.⁹

38
- 84959255361
- Book2Movie: Aligning video scenes with book chapters
- M. Tapaswi, M. Bauml, and R. Stiefelhagen. Book2Movie: Aligning Video scenes with Book chapters. In CVPR, 2015.
- (2015) CVPR
- Tapaswi, M.¹ Bauml, M.² Stiefelhagen, R.³

39
- 84977834021
- Aligning plot synopses to videos for story-based retrieval
- M. Tapaswi, M. Bäuml, and R. Stiefelhagen. Aligning Plot Synopses to Videos for Story-based Retrieval. IJMIR, 4:3-16, 2015.
- (2015) IJMIR , vol.4 , pp. 3-16
- Tapaswi, M.¹ Bäuml, M.² Stiefelhagen, R.³

40
- 84973926486
- Learning common sense through visual abstraction
- R. Vedantam, X. Lin, T. Batra, C. L. Zitnick, and D. Parikh. Learning Common Sense Through Visual Abstraction. In ICCV, 2015.
- (2015) ICCV
- Vedantam, R.¹ Lin, X.² Batra, T.³ Zitnick, C.L.⁴ Parikh, D.⁵

41
- 84944069490
- Translating videos to natural language using deep recurrent neural networks
- abs/1312.6229, cs.CV
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. CoRR abs/1312.6229, cs.CV, 2014.
- (2014) CoRR
- Venugopalan, S.¹ Xu, H.² Donahue, J.³ Rohrbach, M.⁴ Mooney, R.J.⁵ Saenko, K.⁶

42
- 84939821075
- arXiv:1411.4555
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and Tell: A Neural Image Caption Generator. In arXiv:1411.4555, 2014.
- (2014) Show and Tell: A Neural Image Caption Generator
- Vinyals, O.¹ Toshev, A.² Bengio, S.³ Erhan, D.⁴

43
- 84944062514
- Machine comprehension with syntax, frames, and semantics
- H. Wang, M. Bansal, K. Gimpel, and D. McAllester. Machine Comprehension with Syntax, Frames, and Semantics. In ACL, 2015.
- (2015) ACL
- Wang, H.¹ Bansal, M.² Gimpel, K.³ McAllester, D.⁴

44
- 84930622674
- arXiv:1502.05698
- J. Weston, A. Bordes, S. Chopra, and T. Mikolov. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. In arXiv:1502.05698, 2014.
- (2014) Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
- Weston, J.¹ Bordes, A.² Chopra, S.³ Mikolov, T.⁴

45
- 80053258778
- Corpus-guided sentence generation of natural images
- Y. Yang, C. L. Teo, H. Daumé, III, and Y. Aloimonos. Corpus-guided Sentence Generation of Natural Images. In EMNLP, pages 444-454, 2011.
- (2011) EMNLP , pp. 444-454
- Yang, Y.¹ Teo, C.L.² Daumé, H.³ Aloimonos, Y.⁴

46
- 84906494296
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. In TACL, 2014.
- (2014) TACL
- Young, P.¹ Lai, A.² Hodosh, M.³ Hockenmaier, J.⁴

47
- 84959862697
- Visual madlibs: Fill in the blank image generation and question answering
- L. Yu, E. Park, A. C. Berg, and T. L. Berg. Visual Madlibs: Fill in the blank Image Generation and Question Answering. In ICCV, 2015.
- (2015) ICCV
- Yu, L.¹ Park, E.² Berg, A.C.³ Berg, T.L.⁴

48
- 84937964578
- Learning deep features for scene recognition using places database
- B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for Scene Recognition using Places Database. In NIPS, 2014.
- (2014) NIPS
- Zhou, B.¹ Lapedriza, A.² Xiao, J.³ Torralba, A.⁴ Oliva, A.⁵

49
- 84973911532
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books
- Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. In ICCV, 2015.
- (2015) ICCV
- Zhu, Y.¹ Kiros, R.² Zemel, R.³ Salakhutdinov, R.⁴ Urtasun, R.⁵ Torralba, A.⁶ Fidler, S.⁷

50
- 84959182108
- Adopting abstract images for semantic scene understanding
- C. Zitnick, R. Vedantam, and D. Parikh. Adopting abstract images for semantic scene understanding. PAMI, PP, 2014.
- (2014) PAMI
- Zitnick, C.¹ Vedantam, R.² Parikh, D.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.