SCOPUS 정보 검색 플랫폼

Advances in Neural Information Processing Systems

Volumn 2017-December, Issue , 2017, Pages 1732-1742

Train longer, generalize better: Closing the generalization gap in large batch training of neural networks

(3) Hoffer, Elad a Hubara, Itay a Soudry, Daniel a

a TECHNION ISRAEL INSTITUTE OF TECHNOLOGY (Israel)

Author keywords

[No Author keywords available]

Indexed keywords

STOCHASTIC MODELS; STOCHASTIC SYSTEMS;

ADDITIONAL EXPERIMENTS; COMMON PRACTICES; DIFFUSION BEHAVIOR; GENERALIZATION PERFORMANCE; LEARNING MODELS; NOVEL ALGORITHM; STATISTICAL MODELING; STOCHASTIC GRADIENT DESCENT;

DEEP LEARNING;

EID: 85046996830 PISSN: 10495258 EISSN: None Source Type: Conference Proceeding
DOI: None Document Type: Conference Paper

Times cited : (692)

References (46)

1
- 84971463350
- arXiv preprint
- Amodei, D., Anubhai, R., Battenberg, E., et al Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.
- (2015) Deep Speech 2: End-to-end Speech Recognition in English and Mandarin
- Amodei, D.¹ Anubhai, R.² Battenberg, E.³

2
- 84904136037
- Large-scale machine learning with stochastic gradient descent
- Springer
- Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010, pp. 177-186. Springer, 2010.
- (2010) Proceedings of COMPSTAT'2010 , pp. 177-186
- Bottou, L.¹

3
- 0040307478
- Anomalous diffusion in disordered media: Statistical mechanisms, models and physical applications
- Bouchaud, J. P. and Georges, A. Anomalous diffusion in disordered media: statistical mechanisms, models and physical applications. Physics reports, 195:127-293, 1990.
- (1990) Physics Reports , vol.195 , pp. 127-293
- Bouchaud, J.P.¹ Georges, A.²

4
- 0023403821
- Anomalous diffusion in random media of any dimensionality
- Bouchaud, J. P. and Comtet, A. Anomalous diffusion in random media of any dimensionality. J. Physique, 48: 1445-1450, 1987.
- (1987) J. Physique , vol.48 , pp. 1445-1450
- Bouchaud, J.P.¹ Comtet, A.²

5
- 34147142335
- Statistics of critical points of Gaussian fields on large-dimensional spaces
- Bray, A. J. and Dean, D. S. Statistics of critical points of Gaussian fields on large-dimensional spaces. Physical Review Letters, 98(15):1-5, 2007.
- (2007) Physical Review Letters , vol.98 , Issue.15 , pp. 1-5
- Bray, A.J.¹ Dean, D.S.²

6
- 84954310140
- The loss surfaces of multilayer networks
- Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. The Loss Surfaces of Multilayer Networks. AISTATS15, 38, 2015.
- AISTATS15 , vol.38 , pp. 2015
- Choromanska, A.¹ Henaff, M.² Mathieu, M.³ Arous, G.B.⁴ LeCun, Y.⁵

7
- 85014228960
- arXiv preprint
- Das, D., Avancha, S., Mudigere, D., et al Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709, 2016.
- (2016) Distributed Deep Learning Using Synchronous Stochastic Gradient Descent
- Das, D.¹ Avancha, S.² Mudigere, D.³

8
- 84945969537
- Rmsprop and equilibrated adaptive learning rates for non-convex optimization
- Dauphin, Y., de Vries, H., Chung, J., and Bengio, Y. Rmsprop and equilibrated adaptive learning rates for non-convex optimization. corr abs/1502.04390 (2015).
- (2015) Corr Abs/1502.04390
- Dauphin, Y.¹ De Vries, H.² Chung, J.³ Bengio, Y.⁴

9
- 84922386830
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
- Dauphin, Y., Pascanu, R., and Gulcehre, C. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, pp. 1-9, 2014.
- (2014) NIPS , pp. 1-9
- Dauphin, Y.¹ Pascanu, R.² Gulcehre, C.³

10
- 84877760312
- Large scale distributed deep networks
- Dean, J., Corrado, G., Monga, R., et al Large scale distributed deep networks. In NIPS, pp. 1223-1231, 2012.
- (2012) NIPS , pp. 1223-1231
- Dean, J.¹ Corrado, G.² Monga, R.³

11
- 72249100259
- ImageNet: A large-scale hierarchical image database
- Deng, J., Dong, W., Socher, R., et al ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- (2009) CVPR09
- Deng, J.¹ Dong, W.² Socher, R.³

12
- 85046996804
- arXiv preprint
- Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
- (2017) Sharp Minima can Generalize for Deep Nets
- Dinh, L.¹ Pascanu, R.² Bengio, S.³ Bengio, Y.⁴

13
- 80052250414
- Adaptive subgradient methods for online learning and stochastic optimization
- Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121-2159, 2011.
- (2011) Journal of Machine Learning Research , vol.12 , Issue.JUL , pp. 2121-2159
- Duchi, J.¹ Hazan, E.² Singer, Y.³

14
- 0040716207
- Multidimensional random walks in random environments with subclassical limiting behavior
- Durrett, R. Multidimensional random walks in random environments with subclassical limiting behavior. Communications in Mathematical Physics, 104(1):87-102, 1986.
- (1986) Communications in Mathematical Physics , vol.104 , Issue.1 , pp. 87-102
- Durrett, R.¹

15
- 84998858755
- Escaping from saddle points-online stochastic gradient for tensor decomposition
- Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points-online stochastic gradient for tensor decomposition. In COLT, pp. 797-842, 2015.
- (2015) COLT , pp. 797-842
- Ge, R.¹ Huang, F.² Jin, C.³ Yuan, Y.⁴

16
- 0001219859
- Regularization theory and neural networks architectures
- Girosi, F., Jones, M., and Poggio, T. Regularization theory and neural networks architectures. Neural computation, 7(2):219-269, 1995.
- (1995) Neural Computation , vol.7 , Issue.2 , pp. 219-269
- Girosi, F.¹ Jones, M.² Poggio, T.³

17
- 85033703452
- arXiv preprint
- Goyal, P., Dollár, P., Girshick, R., et al Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- (2017) Accurate, Large Minibatch Sgd: Training Imagenet in 1 Hour
- Goyal, P.¹ Dollár, P.² Girshick, R.³

18
- 85015190946
- Train faster, generalize better: Stability of stochastic gradient descent
- Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. ICML, pp. 1-24, 2016.
- (2016) ICML , pp. 1-24
- Hardt, M.¹ Recht, B.² Singer, Y.³

19
- 84986274465
- Deep residual learning for image recognition
- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
- (2016) Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 770-778
- He, K.¹ Zhang, X.² Ren, S.³ Sun, J.⁴

20
- 84964923476
- arXiv preprint
- Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- Ioffe, S.¹ Szegedy, C.²

21
- 85015249548
- On large-batch training for deep learning: Generalization gap and sharp minima
- Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
- (2017) ICLR
- Keskar, N.S.¹ Mudigere, D.² Nocedal, J.³ Smelyanskiy, M.⁴ Tang, P.T.P.⁵

22
- 84941620184
- arXiv preprint
- Kingma, D. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- (2014) Adam: A Method for Stochastic Optimization
- Kingma, D.¹ Ba, J.²

23
- 77956002520
- Krizhevsky, A. Learning multiple layers of features from tiny images. 2009.
- (2009) Learning Multiple Layers of Features from Tiny Images
- Krizhevsky, A.¹

24
- 84932095919
- arXiv preprint
- Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
- (2014) One Weird Trick for Parallelizing Convolutional Neural Networks
- Krizhevsky, A.¹

25
- 34249873596
- Efficient backprop in neural networks: Tricks of the trade
- (orr, g. and müller, k., eds.)
- LeCun, Y., Bottou, L., and Orr, G. Efficient backprop in neural networks: Tricks of the trade (orr, g. and müller, k., eds.). Lecture Notes in Computer Science, 1524, 1998a.
- (1998) Lecture Notes in Computer Science , vol.1524
- LeCun, Y.¹ Bottou, L.² Orr, G.³

26
- 0032203257
- Gradient-based learning applied to document recognition
- LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998b.
- (1998) Proceedings of the IEEE , vol.86 , Issue.11 , pp. 2278-2324
- LeCun, Y.¹ Bottou, L.² Bengio, Y.³ Haffner, P.⁴

27
- 85046995364
- PhD thesis, Intel
- Li, M. Scaling Distributed Machine Learning with System and Algorithm Co-design. PhD thesis, Intel, 2017.
- (2017) Scaling Distributed Machine Learning with System and Algorithm Co-design
- Li, M.¹

28
- 84907022486
- Efficient mini-batch training for stochastic optimization
- ACM
- Li, M., Zhang, T., Chen, Y., and Smola, A. J. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661-670. ACM, 2014.
- (2014) Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pp. 661-670
- Li, M.¹ Zhang, T.² Chen, Y.³ Smola, A.J.⁴

29
- 84994358876
- arXiv preprint
- Luong, M.-T., Pham, H., and Manning, C. D. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
- (2015) Effective Approaches to Attention-based Neural Machine Translation
- Luong, M.-T.¹ Pham, H.² Manning, C.D.³

30
- 0040982291
- Random walk in a random environment and 1f noise
- Marinari, E., Parisi, G., Ruelle, D., and Windey, P. Random Walk in a Random Environment and 1f Noise. Physical Review Letters, 50(1):1223-1225, 1983.
- (1983) Physical Review Letters , vol.50 , Issue.1 , pp. 1223-1225
- Marinari, E.¹ Parisi, G.² Ruelle, D.³ Windey, P.⁴

31
- 84924051598
- Human-level control through deep reinforcement learning
- Mnih, V., Kavukcuoglu, K., Silver, D., et al Human-level control through deep reinforcement learning. Nature, 518(7540):529-533, 2015.
- (2015) Nature , vol.518 , Issue.7540 , pp. 529-533
- Mnih, V.¹ Kavukcuoglu, K.² Silver, D.³

32
- 0004135065
- 2 edition
- Montavon, G., Orr, G., and Müller, K.-R. Neural Networks: Tricks of the Trade. 2 edition, 2012. ISBN 978-3-642-35288-1.
- (2012) Neural Networks: Tricks of the Trade
- Montavon, G.¹ Orr, G.² Müller, K.-R.³

33
- 85030990177
- An overview of gradient descent optimization algorithms
- Ruder, S. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.
- (2016) CoRR
- Ruder, S.¹

34
- 84963949906
- Mastering the game of go with deep neural networks and tree search
- Silver, D., Huang, A., Maddison, C. J., et al Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484-489, 2016.
- (2016) Nature , vol.529 , Issue.7587 , pp. 484-489
- Silver, D.¹ Huang, A.² Maddison, C.J.³

35
- 84925410541
- arXiv preprint
- Simonyan, K. e. a. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- (2014) Very Deep Convolutional Networks for Large-scale Image Recognition
- Simonyan, K.¹

36
- 85047015186
- ArXiv e-prints October
- Soudry, D., Hoffer, E., and Srebro, N. The Implicit Bias of Gradient Descent on Separable Data. ArXiv e-prints, October 2017.
- (2017) The Implicit Bias of Gradient Descent on Separable Data
- Soudry, D.¹ Hoffer, E.² Srebro, N.³

37
- 85046996349
- arXiv preprint
- Soudry, D. and Hoffer, E. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.
- (2017) Exponentially Vanishing Sub-optimal Local Minima in Multilayer Neural Networks
- Soudry, D.¹ Hoffer, E.²

38
- 84904163933
- Dropout: A simple way to prevent neural networks from overfitting
- Srivastava, N., and Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929-1958, 2014.
- (2014) Journal of Machine Learning Research , vol.15 , Issue.1 , pp. 1929-1958
- Srivastava, N.¹ Hinton, G.E.² Krizhevsky, A.³ Sutskever, I.⁴ Salakhutdinov, R.⁵

39
- 84892623436
- On the importance of initialization and momentum in deep learning
- Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139-1147, 2013.
- (2013) International Conference on Machine Learning , pp. 1139-1147
- Sutskever, I.¹ Martens, J.² Dahl, G.³ Hinton, G.⁴

40
- 84986296808
- Rethinking the inception architecture for computer vision
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818-2826, 2016.
- (2016) Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 2818-2826
- Szegedy, C.¹ Vanhoucke, V.² Ioffe, S.³ Shlens, J.⁴ Wojna, Z.⁵

41
- 84897550107
- Regularization of neural networks using dropconnect
- JMLR.org
- Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. Regularization of neural networks using dropconnect. ICML'13, pp. III-1058-III-1066. JMLR.org, 2013.
- (2013) ICML'13 , pp. III1058-III1066
- Wan, L.¹ Zeiler, M.² Zhang, S.³ LeCun, Y.⁴ Fergus, R.⁵

42
- 85013200323
- Google's neural machine translation system: Bridging the gap between human and machine translation
- Wu, Y., Schuster, M., Chen, Z., et al Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016.
- (2016) CoRR
- Wu, Y.¹ Schuster, M.² Chen, Z.³

43
- 85047006686
- arXiv preprint
- You, Y., Gitman, I., and Ginsburg, B. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 2017.
- (2017) Scaling Sgd Batch Size to 32k for Imagenet Training
- You, Y.¹ Gitman, I.² Ginsburg, B.³

44
- 85047020267
- Wide residual networks
- Zagoruyko, K. Wide residual networks. In BMVC, 2016.
- (2016) BMVC
- Zagoruyko, K.¹

45
- 85041447831
- Understanding deep learning requires rethinking generalization
- Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
- (2017) ICLR
- Zhang, C.¹ Bengio, S.² Hardt, M.³ Recht, B.⁴ Vinyals, O.⁵

46
- 84965152276
- Deep learning with elastic averaging sgd
- Zhang, S., and Choromanska, A. E., and LeCun, Y. Deep learning with elastic averaging sgd. In NIPS, pp. 685-693, 2015.
- (2015) NIPS , pp. 685-693
- Zhang, S.¹ Choromanska, A.E.² LeCun, Y.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.