SCOPUS 정보 검색 플랫폼

International Journal of Computer Vision

Volumn 124, Issue 3, 2017, Pages 409-421

Uncovering the Temporal Context for Video Question Answering

(4) Zhu, Linchao a Xu, Zhongwen a Yang, Yi a Hauptmann, Alexander G b

a UNIVERSITY OF TECHNOLOGY SYDNEY (Australia)

b CARNEGIE MELLON UNIVERSITY (United States)

Author keywords

Cross media; Video prediction; Video question answering; Video sequence modeling

Indexed keywords

CHANNEL CODING; VIDEO RECORDING;

CROSS-MEDIA; MULTIPLE CHOICE QUESTIONS; QUESTION ANSWERING; TEMPORAL DOMAIN; TEMPORAL STRUCTURES; VIDEO CONTENTS; VIDEO PREDICTION; VIDEO SEQUENCES;

RECURRENT NEURAL NETWORKS;

EID: 85023773582 PISSN: 09205691 EISSN: 15731405 Source Type: Journal
DOI: 10.1007/s11263-017-1033-7 Document Type: Article

Times cited : (213)

References (64)

1
- 84973890960
- Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In International conference on computer vision (ICCV).
- (2015) VQA: Visual question answering. In International conference on computer vision (ICCV)
- Antol, S.¹ Agrawal, A.² Lu, J.³ Mitchell, M.⁴ Batra, D.⁵ Lawrence Zitnick, C.⁶ Parikh, D.⁷

2
- 49949092526
- Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. In The semantic web (pp. 722–735). Springer.
- (2007) Dbpedia: A nucleus for a web of open data. In The semantic web (pp. 722–735). Springer
- Auer, S.¹ Bizer, C.² Kobilarov, G.³ Lehmann, J.⁴ Cyganiak, R.⁵ Ives, Z.⁶

3
- 85083953689
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In International conference on learning representations (ICLR).
- (2015) Neural machine translation by jointly learning to align and translate. In International conference on learning representations (ICLR)
- Bahdanau, D.¹ Cho, K.² Bengio, Y.³

4
- 85027982212
- Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2015). Learning phrase representations using RNN encoder—decoder for statistical machine translation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).
- (2015) Learning phrase representations using RNN encoder—decoder for statistical machine translation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP)
- Cho, K.¹ Van Merrienboer, B.² Gulcehre, C.³ Bahdanau, D.⁴ Bougares, F.⁵ Schwenk, H.⁶ Bengio, Y.⁷

5
- 84939821078
- Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
- Chung, J.¹ Gulcehre, C.² Cho, K.³ Bengio, Y.⁴

6
- 84888340666
- Torch7: A matlab-like environment for machine learning
- Collobert, R., Kavukcuoglu, K., & Farabet, C. (2011). Torch7: A matlab-like environment for machine learning. In Conference on neural information processing systems workshops (NIPS workshops).
- (2011) In Conference on neural information processing systems workshops (NIPS workshops)
- Collobert, R.¹ Kavukcuoglu, K.² Farabet, C.³

7
- 84959236502
- Long-term recurrent convolutional networks for visual recognition and description
- Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Conference on computer vision and pattern recognition (CVPR).
- (2015) In Conference on computer vision and pattern recognition (CVPR)
- Donahue, J.¹ Anne Hendricks, L.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

8
- 84906928552
- Elliott, D., & Keller, F. (2014). Comparing automatic evaluation measures for image description. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).
- (2014) Comparing automatic evaluation measures for image description. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL)
- Elliott, D.¹ Keller, F.²

9
- 84898958665
- DeViSE: A deep visual-semantic embedding model
- Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T. (2013). DeViSE: A deep visual-semantic embedding model. In Conference on neural information processing systems (NIPS).
- (2013) In Conference on neural information processing systems (NIPS)
- Frome, A.¹ Corrado, G.S.² Shlens, J.³ Bengio, S.⁴ Dean, J.⁵ Mikolov, T.⁶

10
- 84959507605
- Recognizing an action using its name: A knowledge-based approach
- Gan, C., Yang, Y., Zhu, L., Zhao, D., & Zhuang, Y. (2016). Recognizing an action using its name: A knowledge-based approach. International Journal of Computer Vision (IJCV), 120, 61–77.
- (2016) International Journal of Computer Vision (IJCV) , vol.120 , pp. 61-77
- Gan, C.¹ Yang, Y.² Zhu, L.³ Zhao, D.⁴ Zhuang, Y.⁵

11
- 84965148420
- Are you talking to a machine?
- Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In Conference on neural information processing systems (NIPS).
- (2015) Dataset and methods for multilingual image question answering. In Conference on neural information processing systems (NIPS)
- Gao, H.¹ Mao, J.² Zhou, J.³ Huang, Z.⁴ Wang, L.⁵ Xu, W.⁶

12
- 84911400494
- Rich feature hierarchies for accurate object detection and semantic segmentation
- Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Conference on computer vision and pattern recognition (CVPR).
- (2014) In Conference on computer vision and pattern recognition (CVPR)
- Girshick, R.¹ Donahue, J.² Darrell, T.³ Malik, J.⁴

13
- 84894905366
- A multi-view embedding space for modeling internet images, tags, and their semantics
- Gong, Y., Ke, Q., Isard, M., & Lazebnik, S. (2014). A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision (IJCV), 106(2), 210–233.
- (2014) International Journal of Computer Vision (IJCV) , vol.106 , Issue.2 , pp. 210-233
- Gong, Y.¹ Ke, Q.² Isard, M.³ Lazebnik, S.⁴

14
- 0031573117
- Long short-term memory
- Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
- (1997) Neural Computation , vol.9 , Issue.8 , pp. 1735-1780
- Hochreiter, S.¹ Schmidhuber, J.²

15
- 84883394520
- Framing image description as a ranking task: Data, models and evaluation metrics
- Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research (JAIR), 47, 853–899.
- (2013) Journal of Artificial Intelligence Research (JAIR) , vol.47 , pp. 853-899
- Hodosh, M.¹ Young, P.² Hockenmaier, J.³

16
- 84969584486
- Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML).
- (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML)
- Ioffe, S.¹ Szegedy, C.²

17
- 84990032802
- Springer, In European conference on computer vision (ECCV
- Jabri, A., Joulin, A., & van der Maaten, L. (2016). Revisiting visual question answering baselines. In European conference on computer vision (ECCV): Springer.
- (2016) Revisiting visual question answering baselines
- Jabri, A.¹ Joulin, A.² van der Maaten, L.³

18
- 84946734827
- Deep visual-semantic alignments for generating image descriptions
- Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Conference on computer vision and pattern recognition (CVPR).
- (2015) In Conference on computer vision and pattern recognition (CVPR)
- Karpathy, A.¹ Fei-Fei, L.²

19
- 84965153327
- Skip-thought vectors
- Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. In Conference on neural information processing systems (NIPS).
- (2015) In Conference on neural information processing systems (NIPS)
- Kiros, R.¹ Zhu, Y.² Salakhutdinov, R.R.³ Zemel, R.⁴ Urtasun, R.⁵ Torralba, A.⁶ Fidler, S.⁷

20
- 85027965008
- Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).
- (2003) Accurate unlexicalized parsing. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL)
- Klein, D.¹ Manning, C.D.²

21
- 84876231242
- ImageNet classification with deep convolutional neural networks
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Conference on neural information processing systems (NIPS).
- (2012) In Conference on neural information processing systems (NIPS)
- Krizhevsky, A.¹ Sutskever, I.² Hinton, G.E.³

22
- 80052901011
- Baby talk: Understanding and generating image descriptions
- Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2011). Baby talk: Understanding and generating image descriptions. In Conference on computer vision and pattern recognition (CVPR).
- (2011) In Conference on computer vision and pattern recognition (CVPR)
- Kulkarni, G.¹ Premraj, V.² Dhar, S.³ Li, S.⁴ Choi, Y.⁵ Berg, A.C.⁶ Berg, T.L.⁷

23
- 84970028761
- Lebret, R., Pinheiro, P. O., & Collobert, R. (2015). Phrase-based image captioning. In International conference on machine learning (ICML).
- (2015) Phrase-based image captioning. In International conference on machine learning (ICML)
- Lebret, R.¹ Pinheiro, P.O.² Collobert, R.³

24
- 85027963396
- Lin, T.-Y., Maire, M., Belongie, S., Perona, P., Ramanan, D., Hays, J., et al
- Lin, T.-Y., Maire, M., Belongie, S., Perona, P., Ramanan, D., Hays, J., et al. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision (ECCV).
- (2014) Microsoft COCO: Common objects in context. In European conference on computer vision (ECCV).

25
- 84959227898
- Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks
- Lin, X., & Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In Conference on computer vision and pattern recognition (CVPR).
- (2015) In Conference on computer vision and pattern recognition (CVPR)
- Lin, X.¹ Parikh, D.²

26
- 84937822746
- Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In Conference on neural information processing systems (NIPS).
- (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In Conference on neural information processing systems (NIPS)
- Malinowski, M.¹ Fritz, M.²

27
- 84973896625
- Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In International conference on computer vision (ICCV).
- (2015) Ask your neurons: A neural-based approach to answering questions about images. In International conference on computer vision (ICCV)
- Malinowski, M.¹ Rohrbach, M.² Fritz, M.³

28
- 85011853174
- Generation and comprehension of unambiguous object descriptions
- Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., & Murphy, K. (2015). Generation and comprehension of unambiguous object descriptions. In Conference on computer vision and pattern recognition (CVPR).
- (2015) In Conference on computer vision and pattern recognition (CVPR)
- Mao, J.¹ Huang, J.² Toshev, A.³ Camburu, O.⁴ Yuille, A.L.⁵ Murphy, K.⁶

29
- 84986312889
- MED. (2014). TRECVID MED 14. http://nist.gov/itl/iad/mig/med14.cfm.
- (2014) TRECVID MED , pp. 14

30
- 84898956512
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Conference on neural information processing systems (NIPS).
- (2013) Distributed representations of words and phrases and their compositionality. In Conference on neural information processing systems (NIPS)
- Mikolov, T.¹ Sutskever, I.² Chen, K.³ Corrado, G.S.⁴ Dean, J.⁵

31
- 84936796885
- Large scale retrieval and generation of image descriptions
- Ordonez, V., Han, X., Kuznetsova, P., Kulkarni, G., Mitchell, M., Yamaguchi, K., et al. (2015). Large scale retrieval and generation of image descriptions. International Journal of Computer Vision (IJCV), 119, 46–59.
- (2015) International Journal of Computer Vision (IJCV) , vol.119 , pp. 46-59
- Ordonez, V.¹ Han, X.² Kuznetsova, P.³ Kulkarni, G.⁴ Mitchell, M.⁵ Yamaguchi, K.⁶

32
- 84986290372
- Hierarchical recurrent neural encoder for video representation with application to captioning
- Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016). Hierarchical recurrent neural encoder for video representation with application to captioning. In Conference on computer vision and pattern recognition (CVPR).
- (2016) In Conference on computer vision and pattern recognition (CVPR)
- Pan, P.¹ Xu, Z.² Yang, Y.³ Wu, F.⁴ Zhuang, Y.⁵

33
- 0013363097
- Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).
- (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL)
- Papineni, K.¹ Roukos, S.² Ward, T.³ Zhu, W.J.⁴

34
- 84898785648
- Grounding action descriptions in videos
- Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., & Pinkal, M. (2013). Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (TACL), 1, 25–36.
- (2013) Transactions of the Association for Computational Linguistics (TACL) , vol.1 , pp. 25-36
- Regneri, M.¹ Rohrbach, M.² Wetzel, D.³ Thater, S.⁴ Schiele, B.⁵ Pinkal, M.⁶

35
- 84965170394
- Exploring models and data for image question answering
- Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In Conference on neural information processing systems (NIPS).
- (2015) In Conference on neural information processing systems (NIPS)
- Ren, M.¹ Kiros, R.² Zemel, R.³

36
- 84959211977
- A dataset for movie description
- Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. In Conference on computer vision and pattern recognition (CVPR).
- (2015) In Conference on computer vision and pattern recognition (CVPR)
- Rohrbach, A.¹ Rohrbach, M.² Tandon, N.³ Schiele, B.⁴

37
- 84898775239
- Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. In International conference on computer vision (ICCV).
- (2013) Translating video content to natural language descriptions. In International conference on computer vision (ICCV)
- Rohrbach, M.¹ Qiu, W.² Titov, I.³ Thater, S.⁴ Pinkal, M.⁵ Schiele, B.⁶

38
- 84947041871
- ImageNet large scale visual recognition challenge
- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.
- (2015) International Journal of Computer Vision (IJCV) , vol.115 , Issue.3 , pp. 211-252
- Russakovsky, O.¹ Deng, J.² Su, H.³ Krause, J.⁴ Satheesh, S.⁵ Ma, S.⁶

39
- 84937862424
- Two-stream convolutional networks for action recognition in videos
- Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Conference on neural information processing systems (NIPS).
- (2014) In Conference on neural information processing systems (NIPS)
- Simonyan, K.¹ Zisserman, A.²

40
- 84969544782
- Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using LSTMs. In International conference on machine learning (ICML).
- (2015) Unsupervised learning of video representations using LSTMs. In International conference on machine learning (ICML)
- Srivastava, N.¹ Mansimov, E.² Salakhudinov, R.³

41
- 84928547704
- Sequence to sequence learning with neural networks
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Conference on neural information processing systems (NIPS).
- (2014) In Conference on neural information processing systems (NIPS)
- Sutskever, I.¹ Vinyals, O.² Le, Q.V.³

42
- 84937522268
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al
- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In Conference on computer vision and pattern recognition (CVPR).
- (2015) Going deeper with convolutions. In Conference on computer vision and pattern recognition (CVPR).

43
- 84986296727
- preprint arXiv:1512.02902
- Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). Movieqa: Understanding stories in movies through question-answering. In Conference on computer vision and pattern recognition (CVPR). arXiv preprint arXiv:1512.02902.
- (2016) Movieqa: Understanding stories in movies through question-answering. In Conference on computer vision and pattern recognition (CVPR). arXiv
- Tapaswi, M.¹ Zhu, Y.² Stiefelhagen, R.³ Torralba, A.⁴ Urtasun, R.⁵ Fidler, S.⁶

44
- 84943546021
- Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSprop: Divide the gradient by a running average of its recent magnitude.
- (2012) Lecture 6.5-RMSprop: Divide the gradient by a running average of its recent magnitude
- Tieleman, T.¹ Hinton, G.²

45
- 84973865953
- Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In International conference on computer vision (ICCV).
- (2015) Learning spatiotemporal features with 3D convolutional networks. In International conference on computer vision (ICCV)
- Tran, D.¹ Bourdev, L.² Fergus, R.³ Torresani, L.⁴ Paluri, M.⁵

46
- 84901405262
- Joint video and text parsing for understanding events and answering queries
- Tu, K., Meng, M., Lee, M. W., Choe, T. E., & Zhu, S. C. (2014). Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia, 21(2), 42–70.
- (2014) IEEE MultiMedia , vol.21 , Issue.2 , pp. 42-70
- Tu, K.¹ Meng, M.² Lee, M.W.³ Choe, T.E.⁴ Zhu, S.C.⁵

47
- 57249084011
- Visualizing data using t-SNE
- Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9, 2579–2605.
- (2008) Journal of Machine Learning Research (JMLR) , vol.9 , pp. 2579-2605
- Van der Maaten, L.¹ Hinton, G.²

48
- 84956980995
- Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). CIDEr: Consensus-based image description evaluation. In Conference on computer vision and pattern recognition (CVPR).
- (2015) CIDEr: Consensus-based image description evaluation. In Conference on computer vision and pattern recognition (CVPR)
- Vedantam, R.¹ Lawrence Zitnick, C.² Parikh, D.³

49
- 84973882730
- Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence—video to text. In International conference on computer vision (ICCV).
- (2015) Sequence to sequence—video to text. In International conference on computer vision (ICCV)
- Venugopalan, S.¹ Rohrbach, M.² Donahue, J.³ Mooney, R.⁴ Darrell, T.⁵ Saenko, K.⁶

50
- 84946747440
- Show and tell: A neural image caption generator
- Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Conference on computer vision and pattern recognition (CVPR).
- (2015) In Conference on computer vision and pattern recognition (CVPR)
- Vinyals, O.¹ Toshev, A.² Bengio, S.³ Erhan, D.⁴

51
- 85028014257
- Vondrick, C., Pirsiavash, H., & Torralba, A. (2015). Anticipating the future by watching unlabeled video. Conference on computer vision and pattern recognition (CVPR).
- (2015) Anticipating the future by watching unlabeled video. Conference on computer vision and pattern recognition (CVPR)
- Vondrick, C.¹ Pirsiavash, H.² Torralba, A.³

52
- 84876945537
- Dense trajectories and motion boundary descriptors for action recognition
- Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision (IJCV), 103(1), 60–79.
- (2013) International Journal of Computer Vision (IJCV) , vol.103 , Issue.1 , pp. 60-79
- Wang, H.¹ Kläser, A.² Schmid, C.³ Liu, C.L.⁴

53
- 84986320870
- Wu, Q., Wang, P., Shen, C., Dick, A., & van den Hengel, A. (2016). Ask me anything: Free-form visual question answering based on knowledge from external sources. In Conference on computer vision and pattern recognition (CVPR).
- (2016) Ask me anything: Free-form visual question answering based on knowledge from external sources. In Conference on computer vision and pattern recognition (CVPR)
- Wu, Q.¹ Wang, P.² Shen, C.³ Dick, A.⁴ van den Hengel, A.⁵

54
- 84970002232
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., & Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML).
- (2015) Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML)
- Xu, K.¹ Ba, J.² Kiros, R.³ Cho, K.⁴ Courville, A.⁵ Salakhudinov, R.⁶ Bengio, Y.⁷

55
- 84959226659
- A discriminative CNN video representation for event detection
- Xu, Z., Yang, Y., & Hauptmann, A. G. (2015b). A discriminative CNN video representation for event detection. In Conference on computer vision and pattern recognition (CVPR)
- (2015) In Conference on computer vision and pattern recognition (CVPR)
- Xu, Z.¹ Yang, Y.² Hauptmann, A.G.³

56
- 84999792442
- Image classification by cross-media active learning with privileged information
- Yan, Y., Nie, F., Li, W., Gao, C., Yang, Y., & Xu, D. (2016). Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia, 18(12), 2494–2502.
- (2016) IEEE Transactions on Multimedia , vol.18 , Issue.12 , pp. 2494-2502
- Yan, Y.¹ Nie, F.² Li, W.³ Gao, C.⁴ Yang, Y.⁵ Xu, D.⁶

57
- 72449143147
- Yang, Y., Xu, D., Nie, F., Luo, J., & Zhuang, Y. (2009). Ranking with local regression and global alignment for cross media retrieval. In Proceedings of the 17th ACM international conference on multimedia (pp. 175–184). ACM.
- (2009) Ranking with local regression and global alignment for cross media retrieval. In Proceedings of the 17th ACM international conference on multimedia (pp. 175–184). ACM
- Yang, Y.¹ Xu, D.² Nie, F.³ Luo, J.⁴ Zhuang, Y.⁵

58
- 84973884896
- Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. In International conference on computer vision (ICCV).
- (2015) Describing videos by exploiting temporal structure. In International conference on computer vision (ICCV)
- Yao, L.¹ Torabi, A.² Cho, K.³ Ballas, N.⁴ Pal, C.⁵ Larochelle, H.⁶ Courville, A.⁷

59
- 84906494296
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
- Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (TACL), 2, 67–78.
- (2014) Transactions of the Association for Computational Linguistics (TACL) , vol.2 , pp. 67-78
- Young, P.¹ Lai, A.² Hodosh, M.³ Hockenmaier, J.⁴

60
- 84897743886
- Yu, H., & Siskind, J. M. (2013). Grounded language learning from video described with sentences. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).
- (2013) Grounded language learning from video described with sentences. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL)
- Yu, H.¹ Siskind, J.M.²

61
- 84973892583
- Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual Madlibs: Fill in the blank image generation and question answering. In International conference on computer vision (ICCV).
- (2015) Visual Madlibs: Fill in the blank image generation and question answering. In International conference on computer vision (ICCV)
- Yu, L.¹ Park, E.² Berg, A.C.³ Berg, T.L.⁴

62
- 84944053926
- preprint arXiv:1409.2329
- Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
- (2014) Recurrent neural network regularization. arXiv
- Zaremba, W.¹ Sutskever, I.² Vinyals, O.³

63
- 84986275767
- Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7w: Grounded question answering in images. In Conference on computer vision and pattern recognition (CVPR).
- (2016) Visual7w: Grounded question answering in images. In Conference on computer vision and pattern recognition (CVPR)
- Zhu, Y.¹ Groth, O.² Bernstein, M.³ Fei-Fei, L.⁴

64
- 84973911532
- Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In International conference on computer vision (ICCV).
- (2015) Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In International conference on computer vision (ICCV)
- Zhu, Y.¹ Kiros, R.² Zemel, R.³ Salakhutdinov, R.⁴ Urtasun, R.⁵ Torralba, A.⁶ Fidler, S.⁷

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.