SCOPUS 정보 검색 플랫폼

MM 2015 - Proceedings of the 2015 ACM Multimedia Conference

Volumn , Issue , 2015, Pages 461-470

Modeling spatial-Temporal clues in a hybrid deep learning framework for video classification

(5) Wu, Zuxuan a Wang, Xi a Jiang, Yu Gang a Ye, Hao a Xue, Xiangyang a

a FUDAN UNIVERSITY (China)

Author keywords

CNN; Deep Learning; Fusion.; LSTM; Video Classification

Indexed keywords

BENCHMARKING; FUSION REACTIONS; IMAGE CLASSIFICATION; MOTION ESTIMATION; NEURAL NETWORKS; SEMANTICS; VIDEO STREAMING;

COMPETITIVE PERFORMANCE; CONVOLUTIONAL NEURAL NETWORK; DEEP LEARNING; LONG SHORT TERM MEMORY; LSTM; SPATIAL INFORMATIONS; SPATIAL TEMPORALS; VIDEO CLASSIFICATION;

CLASSIFICATION (OF INFORMATION);

EID: 84962921420 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2733373.2806222 Document Type: Conference Paper

Times cited : (414)

References (54)

1
- 84866678025
- Three things everyone should know to improve object retrieval
- R. Arandjelovic and A. Zisserman. Three things everyone should know to improve object retrieval. In CVPR, 2012.
- (2012) CVPR
- Arandjelovic, R.¹ Zisserman, A.²

2
- 14344252374
- Multiple kernel learning, conic duality, and the smo algorithm
- F. R. Bach, G. R. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In ICML, 2004.
- (2004) ICML
- Bach, F.R.¹ Lanckriet, G.R.² Jordan, M.I.³

3
- 84872560515
- Practical recommendations for gradient-based training of deep architectures
- Springer
- Y. Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade. Springer, 2012.
- (2012) Neural Networks: Tricks of the Trade
- Bengio, Y.¹

4
- 84959885402
- Gated feedback recurrent neural networks
- J. Chung, C. Gülcehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. CoRR, 2015.
- (2015) CoRR
- Chung, J.¹ Gülcehre, C.² Cho, K.³ Bengio, Y.⁴

5
- 84055222005
- Context-dependent pre-Trained deep neural networks for large-vocabulary speech recognition
- G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE TASLP, 2012.
- (2012) IEEE TASLP
- Dahl, G.E.¹ Yu, D.² Deng, L.³ Acero, A.⁴

6
- 84946802546
- Long-Term recurrent convolutional networks for visual recognition and description
- J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-Term recurrent convolutional networks for visual recognition and description. CoRR, 2014.
- (2014) CoRR
- Donahue, J.¹ Hendricks, L.A.² Guadarrama, S.³ Rohrbach, M.⁴ Venugopalan, S.⁵ Saenko, K.⁶ Darrell, T.⁷

7
- 84911400494
- Rich feature hierarchies for accurate object detection and semantic segmentation
- R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- (2014) CVPR
- Girshick, R.¹ Donahue, J.² Darrell, T.³ Malik, J.⁴

8
- 84890543083
- Speech recognition with deep recurrent neural networks
- A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
- (2013) ICASSP
- Graves, A.¹ Mohamed, A.² Hinton, G.E.³

9
- 27744588611
- Framewise phoneme classification with bidirectional lstm and other neural network architectures
- A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 2005.
- (2005) Neural Networks
- Graves, A.¹ Schmidhuber, J.²

10
- 0001466773
- A combined corner and edge detector
- C. Harris and M. J. Stephens. A combined corner and edge detector. In Alvey Vision Conference, 1988.
- (1988) Alvey Vision Conference
- Harris, C.¹ Stephens, M.J.²

11
- 84986296521
- University of Amsterdam at thumos challenge 2014
- M. Jain, J. van Gemert, and C. G. M. Snoek. University of amsterdam at thumos challenge 2014. In ECCV THUMOS Challenge Workshop, 2014.
- (2014) ECCV THUMOS Challenge Workshop
- Jain, M.¹ Van Gemert, J.² Snoek, C.G.M.³

12
- 84894901576
- Discovering joint audio-visual codewords for video event detection
- I.-H. Jhuo, G. Ye, S. Gao, D. Liu, Y.-G. Jiang, D. T. Lee, and S.-F. Chang. Discovering joint audio-visual codewords for video event detection. Machine Vision and Applications, 2014.
- (2014) Machine Vision and Applications
- Jhuo, I.-H.¹ Ye, G.² Gao, S.³ Liu, D.⁴ Jiang, Y.-G.⁵ Lee, D.T.⁶ Chang, S.-F.⁷

13
- 77956507967
- 3d convolutional neural networks for human action recognition
- S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. In ICML, 2010.
- (2010) ICML
- Ji, S.¹ Xu, W.² Yang, M.³ Yu, K.⁴

14
- 84913580146
- Caffe: Convolutional architecture for fast feature embedding
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014.
- (2014) ACM Multimedia
- Jia, Y.¹ Shelhamer, E.² Donahue, J.³ Karayev, S.⁴ Long, J.⁵ Girshick, R.⁶ Guadarrama, S.⁷ Darrell, T.⁸

15
- 72549099611
- Short-Term audio-visual atoms for generic video concept classification
- W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. Loui. Short-Term audio-visual atoms for generic video concept classification. In ACM Multimedia, 2009.
- (2009) ACM Multimedia
- Jiang, W.¹ Cotton, C.² Chang, S.-F.³ Ellis, D.⁴ Loui, A.⁵

16
- 84986185450
- High-level event recognition in unconstrained videos
- Y.-G. Jiang, S. Bhattacharya, S.-F. Chang, and M. Shah. High-level event recognition in unconstrained videos. IJMIR, 2013.
- (2013) IJMIR
- Jiang, Y.-G.¹ Bhattacharya, S.² Chang, S.-F.³ Shah, M.⁴

17
- 84905052261
- Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/, 2014.
- (2014) THUMOS Challenge: Action Recognition with A Large Number of Classes
- Jiang, Y.-G.¹ Liu, J.² Roshan Zamir, A.³ Toderici, G.⁴ Laptev, I.⁵ Shah, M.⁶ Sukthankar, R.⁷

18
- 79959766559
- Consumer video understanding: A benchmark database and an evaluation of human and machine performance
- Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR, 2011.
- (2011) ICMR
- Jiang, Y.-G.¹ Ye, G.² Chang, S.-F.³ Ellis, D.⁴ Loui, A.C.⁵

19
- 84911364368
- Large-scale video classification with convolutional neural networks
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- (2014) CVPR
- Karpathy, A.¹ Toderici, G.² Shetty, S.³ Leung, T.⁴ Sukthankar, R.⁵ Fei-Fei, L.⁶

20
- 84898426452
- A spatio-Temporal descriptor based on 3d-gradients
- A. Klaser, M. Marsza lek, and C. Schmid. A spatio-Temporal descriptor based on 3d-gradients. In BMVC, 2008.
- (2008) BMVC
- Klaser, A.¹ Marszalek, M.² Schmid, C.³

21
- 79955848223
- Lp-norm multiple kernel learning
- M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. The Journal of Machine Learning Research, 2011.
- (2011) The Journal of Machine Learning Research
- Kloft, M.¹ Brefeld, U.² Sonnenburg, S.³ Zien, A.⁴

22
- 84876231242
- Imagenet classification with deep convolutional neural networks
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
- (2012) NIPS
- Krizhevsky, A.¹ Sutskever, I.² Hinton, G.E.³

23
- 84962801975
- Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition
- Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. CoRR, 2014.
- (2014) CoRR
- Lan, Z.¹ Lin, M.² Li, X.³ Hauptmann, A.G.⁴ Raj, B.⁵

24
- 84962874838
- On space-Time interest points
- I. Laptev. On space-Time interest points. IJCV, 2007.
- (2007) IJCV
- Laptev, I.¹

25
- 55149112799
- Expandable data-driven graphical modeling of human actions based on salient postures
- W. Li, Z. Zhang, and Z. Liu. Expandable data-driven graphical modeling of human actions based on salient postures. IEEE TCSVT, 2008.
- (2008) IEEE TCSVT
- Li, W.¹ Zhang, Z.² Liu, Z.³

26
- 84887331855
- Sample-specific late fusion for visual category recognition
- D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang. Sample-specific late fusion for visual category recognition. In CVPR, 2013.
- (2013) CVPR
- Liu, D.¹ Lai, K.-T.² Ye, G.³ Chen, M.-S.⁴ Chang, S.-F.⁵

27
- 3042535216
- Distinctive image features from scale-invariant keypoints
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
- (2004) IJCV
- Lowe, D.G.¹

28
- 84905653531
- Reduced analytic dependency modeling: Robust fusion for visual recognition
- A. J. Ma and P. C. Yuen. Reduced analytic dependency modeling: Robust fusion for visual recognition. IJCV, 2014.
- (2014) IJCV
- Ma, A.J.¹ Yuen, P.C.²

29
- 80053437179
- Multimodal deep learning
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. In ICML, 2011.
- (2011) ICML
- Ngiam, J.¹ Khosla, A.² Kim, M.³ Nam, J.⁴ Lee, H.⁵ Ng, A.⁶

30
- 84898791167
- Action and event recognition with fisher vectors on a compact feature set
- D. Oneata, J. Verbeek, C. Schmid, et al. Action and event recognition with fisher vectors on a compact feature set. In ICCV, 2013.
- (2013) ICCV
- Oneata, D.¹ Verbeek, J.² Schmid, C.³

31
- 84962897886
- Video (language) modeling: A baseline for generative models of natural videos
- M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: A baseline for generative models of natural videos. CoRR, 2014.
- (2014) CoRR
- Ranzato, M.¹ Szlam, A.² Bruna, J.³ Mathieu, M.⁴ Collobert, R.⁵ Chopra, S.⁶

32
- 84906341074
- CNN features off-The-shelf: An astounding baseline for recognition
- A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-The-shelf: An astounding baseline for recognition. CoRR, 2014.
- (2014) CoRR
- Razavian, A.S.¹ Azizpour, H.² Sullivan, J.³ Carlsson, S.⁴

33
- 84937862424
- Two-stream convolutional networks for action recognition in videos
- K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
- (2014) NIPS
- Simonyan, K.¹ Zisserman, A.²

34
- 84933585162
- Very deep convolutional networks for large-scale image recognition
- K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014.
- (2014) CoRR
- Simonyan, K.¹ Zisserman, A.²

35
- 84893702065
- UCF101: A dataset of 101 human actions classes from videos in the wild
- K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, 2012.
- (2012) CoRR
- Soomro, K.¹ Zamir, A.R.² Shah, M.³

36
- 84962900096
- Unsupervised learning of video representations using LSTMs
- N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. CoRR, 2015.
- (2015) CoRR
- Srivastava, N.¹ Mansimov, E.² Salakhutdinov, R.³

37
- 84877724347
- Multimodal learning with deep boltzmann machines
- N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012.
- (2012) NIPS
- Srivastava, N.¹ Salakhutdinov, R.²

38
- 84941122549
- Going deeper with convolutions
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. CoRR, 2014.
- (2014) CoRR
- Szegedy, C.¹ Liu, W.² Jia, Y.³ Sermanet, P.⁴ Reed, S.⁵ Anguelov, D.⁶ Erhan, D.⁷ Vanhoucke, V.⁸ Rabinovich, A.⁹

39
- 84866658784
- Learning latent temporal structure for complex event detection
- K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012.
- (2012) CVPR
- Tang, K.¹ Fei-Fei, L.² Koller, D.³

40
- 84962823150
- C3d: Generic features for video analysis
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: Generic features for video analysis. CoRR, 2014.
- (2014) CoRR
- Tran, D.¹ Bourdev, L.² Fergus, R.³ Torresani, L.⁴ Paluri, M.⁵

41
- 84856105124
- Conditional random fields for activity recognition
- D. L. Vail, M. M. Veloso, and J. D. Lafferty. Conditional random fields for activity recognition. In AAAMS, 2007.
- (2007) AAAMS
- Vail, D.L.¹ Veloso, M.M.² Lafferty, J.D.³

42
- 77953196456
- Multiple kernels for object detection
- A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In ICCV, 2009.
- (2009) ICCV
- Vedaldi, A.¹ Gulshan, V.² Varma, M.³ Zisserman, A.⁴

43
- 84944069490
- Translating videos to natural language using deep recurrent neural networks
- S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. CoRR, 2014.
- (2014) CoRR
- Venugopalan, S.¹ Xu, H.² Donahue, J.³ Rohrbach, M.⁴ Mooney, R.J.⁵ Saenko, K.⁶

44
- 84898805910
- Action recognition with improved trajectories
- H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
- (2013) ICCV
- Wang, H.¹ Schmid, C.²

45
- 84898890371
- Evaluation of local spatio-Temporal features for action recognition
- H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid. Evaluation of local spatio-Temporal features for action recognition. In BMVC, 2009.
- (2009) BMVC
- Wang, H.¹ Ullah, M.M.² Klaser, A.³ Laptev, I.⁴ Schmid, C.⁵

46
- 70450216856
- Max-margin hidden conditional random fields for human action recognition
- Y. Wang and G. Mori. Max-margin hidden conditional random fields for human action recognition. In CVPR, 2009.
- (2009) CVPR
- Wang, Y.¹ Mori, G.²

47
- 84913586072
- Exploring inter-feature and inter-class relationships with deep neural networks for video classification
- Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In ACM Multimedia, 2014.
- (2014) ACM Multimedia
- Wu, Z.¹ Jiang, Y.-G.² Wang, J.³ Pu, J.⁴ Xue, X.⁵

48
- 84898834622
- Feature weighting via optimal thresholding for video analysis
- Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. Hauptmann. Feature weighting via optimal thresholding for video analysis. In ICCV, 2013.
- (2013) ICCV
- Xu, Z.¹ Yang, Y.² Tsang, I.³ Sebe, N.⁴ Hauptmann, A.⁵

49
- 84962874851
- Efficient online learning for multitask feature selection
- H. Yang, M. R. Lyu, and I. King. Efficient online learning for multitask feature selection. ACM SIGKDD, 2013.
- (2013) ACM SIGKDD
- Yang, H.¹ Lyu, M.R.² King, I.³

50
- 84866712367
- Robust late fusion with rank minimization
- G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang. Robust late fusion with rank minimization. In CVPR, 2012.
- (2012) CVPR
- Ye, G.¹ Liu, D.² Jhuo, I.-H.³ Chang, S.-F.⁴

51
- 84962376858
- Evaluating two-stream cnn for video classification
- H. Ye, Z. Wu, R.-W. Zhao, X. Wang, Y.-G. Jiang, and X. Xue. Evaluating two-stream cnn for video classification. In ICMR, 2015.
- (2015) ICMR
- Ye, H.¹ Wu, Z.² Zhao, R.-W.³ Wang, X.⁴ Jiang, Y.-G.⁵ Xue, X.⁶

52
- 80054879214
- Knowledge based activity recognition with dynamic Bayesian network
- Z. Zeng and Q. Ji. Knowledge based activity recognition with dynamic Bayesian network. In ECCV, 2010.
- (2010) ECCV
- Zeng, Z.¹ Ji, Q.²

53
- 84962833704
- Exploiting image-Trained cnn architectures for unconstrained video classification
- S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov. Exploiting image-Trained cnn architectures for unconstrained video classification. CoRR, 2015.
- (2015) CoRR
- Zha, S.¹ Luisier, F.² Andrews, W.³ Srivastava, N.⁴ Salakhutdinov, R.⁵

54
- 33846580425
- Local features and kernels for classification of texture and object categories: A comprehensive study
- J. Zhang, M. Marsza lek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. IJCV, 2007.
- (2007) IJCV
- Zhang, J.¹ Marszalek, M.² Lazebnik, S.³ Schmid, C.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.