메뉴 건너뛰기




Volumn 15, Issue , 2014, Pages 809-883

Policy evaluation with temporal differences: A survey and comparison

Author keywords

Policy evaluation; Reinforcement learning; Temporal differences; Value function estimation

Indexed keywords

REINFORCEMENT LEARNING; UNCERTAINTY ANALYSIS;

EID: 84899800132     PISSN: 15324435     EISSN: 15337928     Source Type: Journal    
DOI: None     Document Type: Review
Times cited : (274)

References (89)
  • 1
    • 0016556021 scopus 로고
    • A new approach to manipulator control: The cerebellar model articulation controller (CMAC)
    • September
    • J. S. Albus. A new approach to manipulator control: The cerebellar model articulation controller (CMAC). Journal of Dynamic Systems Measurement and Control, 97(September): 220-227, 1975.
    • (1975) Journal of Dynamic Systems Measurement and Control , vol.97 , pp. 220-227
    • Albus, J.S.1
  • 2
    • 0000396062 scopus 로고    scopus 로고
    • Natural gradient works efficiently in learning
    • S.-i. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2): 251-276, Feb. 1998. (Pubitemid 128463152)
    • (1998) Neural Computation , vol.10 , Issue.2 , pp. 251-276
    • Amari, S.-I.1
  • 3
    • 40849145988 scopus 로고    scopus 로고
    • Learning near-optimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path
    • A. Antos, C. Szepesvári, and R. Munos. Learning near-optimal policies with Bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89-129, 2008.
    • (2008) Machine Learning , vol.71 , Issue.1 , pp. 89-129
    • Antos, A.1    Szepesvári, C.2    Munos, R.3
  • 4
    • 85162480829 scopus 로고    scopus 로고
    • Non-asymptotic analysis of stochastic approximation algorithms for machine learning
    • F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, 2011.
    • (2011) Advances in Neural Information Processing Systems , vol.24
    • Bach, F.1    Moulines, E.2
  • 6
    • 77955920480 scopus 로고    scopus 로고
    • Accuracy of reinforcement learning algorithms for predicting aircraft taxi-out times: A case-study of Tampa Bay departures
    • P. Balakrishna, R. Ganesan, and L. Sherry. Accuracy of reinforcement learning algorithms for predicting aircraft taxi-out times: A case-study of Tampa Bay departures. Transportation Research Part C: Emerging Technologies, 18(6):950-962, 2010.
    • (2010) Transportation Research Part C: Emerging Technologies , vol.18 , Issue.6 , pp. 950-962
    • Balakrishna, P.1    Ganesan, R.2    Sherry, L.3
  • 8
    • 61849106433 scopus 로고    scopus 로고
    • Projected equation methods for approximate solution of large linear systems
    • D. P. Bertsekas and H. Yu. Projected equation methods for approximate solution of large linear systems. Journal of Computational and Applied Mathematics, 227(1):27-50, 2009.
    • (2009) Journal of Computational and Applied Mathematics , vol.227 , Issue.1 , pp. 27-50
    • Bertsekas, D.P.1    Yu, H.2
  • 9
    • 0036832950 scopus 로고    scopus 로고
    • Technical update: Least-squares temporal difference learning
    • DOI 10.1023/A:1017936530646
    • J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2):233-246, 2002. (Pubitemid 34325688)
    • (2002) Machine Learning , vol.49 , Issue.2-3 , pp. 233-246
    • Boyan, J.A.1
  • 10
    • 0001771345 scopus 로고    scopus 로고
    • Linear least-squares algorithms for temporal difference learning
    • S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33-57, 1996. (Pubitemid 126724362)
    • (1996) Machine Learning , vol.22 , Issue.1-3 , pp. 33-57
    • Bradtke, S.J.1
  • 11
    • 34548275795 scopus 로고    scopus 로고
    • The Dantzig selector: Statistical estimation when p is much larger than n
    • E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313-2351, 2005.
    • (2005) The Annals of Statistics , vol.35 , Issue.6 , pp. 2313-2351
    • Candes, E.1    Tao, T.2
  • 12
    • 33646435300 scopus 로고    scopus 로고
    • A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning
    • D. Choi and B. Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems, 16(2):207-239, 2006.
    • (2006) Discrete Event Dynamic Systems , vol.16 , Issue.2 , pp. 207-239
    • Choi, D.1    Roy, B.2
  • 14
    • 0032208335 scopus 로고    scopus 로고
    • Elevator group control using multiple reinforcement learning agents
    • R. H. Crites and A. G. Barto. Elevator group control using multiple reinforcement learning agents. Machine Learning, 33(2-3):235-262, 1998. (Pubitemid 128522644)
    • (1998) Machine Learning , vol.12 , Issue.4 , pp. 235-262
    • Crites, R.H.1    Barto, A.G.2
  • 23
    • 83155175393 scopus 로고    scopus 로고
    • Model selection in reinforcement learning
    • A.-m. Farahmand and C. Szepesvári. Model selection in reinforcement learning. Machine Learning, 85(3):299-332, 2011.
    • (2011) Machine Learning , vol.85 , Issue.3 , pp. 299-332
    • Farahmand, A.-M.1    Szepesvári, C.2
  • 37
    • 0001240715 scopus 로고
    • Importance sampling for stochastic simulations
    • P. W. Glynn and D. L. Iglehart. Importance sampling for stochastic simulations. Management Science, 35(11):1367-1392, 1989.
    • (1989) Management Science , vol.35 , Issue.11 , pp. 1367-1392
    • Glynn, P.W.1    Iglehart, D.L.2
  • 47
    • 0030721089 scopus 로고    scopus 로고
    • Comparison of CMACs and radial basis functions for kocal function approximators in reinforcement learning
    • R. M. Kretchmar and C. W. Anderson. Comparison of CMACs and radial basis functions for kocal function approximators in reinforcement learning. In International Conference on Neural Networks, 1997.
    • (1997) International Conference on Neural Networks
    • Kretchmar, R.M.1    Anderson, C.W.2
  • 50
    • 56449125197 scopus 로고    scopus 로고
    • A worst-case comparison between temporal difference and residual gradient with linear function approximation
    • L. Li. A worst-case comparison between temporal difference and residual gradient with linear function approximation. In Proceedings of the 25th International Conference on Machine Learning, 2008.
    • (2008) Proceedings of the 25th International Conference on Machine Learning
    • Li, L.1
  • 54
    • 35748957806 scopus 로고    scopus 로고
    • Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes
    • S. Mahadevan and M. Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in Markov decision processes. Journal of Machine Learning Research, 8(Oct):2169-2231, 2007. (Pubitemid 350046199)
    • (2007) Journal of Machine Learning Research , vol.8 , pp. 2169-2231
    • Mahadevan, S.1    Maggioni, M.2
  • 56
    • 17444414191 scopus 로고    scopus 로고
    • Basis function adaptation in temporal difference reinforcement learning
    • DOI 10.1007/s10479-005-5732-z
    • I. Menache, S. Mannor, and N. Shimkin. Basis function adaptation in temporal difference reinforcement learning. Annals of Operations Research, 134(1):215-238, 2005. (Pubitemid 40550047)
    • (2005) Annals of Operations Research , vol.134 , Issue.1 , pp. 215-238
    • Menache, I.1    Mannor, S.2    Shimkin, N.3
  • 58
    • 0037288398 scopus 로고    scopus 로고
    • Least squares policy evaluation algorithms with linear function approximation
    • A. Nedic and D. P. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13(1-2):79-110, 2003.
    • (2003) Discrete Event Dynamic Systems , vol.13 , Issue.1-2 , pp. 79-110
    • Nedic, A.1    Bertsekas, D.P.2
  • 67
  • 71
    • 0001201756 scopus 로고
    • Some studies in machine learning using the game of checkers
    • A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210-229, 1959.
    • (1959) IBM Journal of Research and Development , vol.3 , Issue.3 , pp. 210-229
    • Samuel, A.L.1
  • 72
    • 77956551905 scopus 로고    scopus 로고
    • Should one compute the temporal difference fix point or minimize the Bellman residual? the unified oblique projection view
    • B. Scherrer. Should one compute the temporal difference fix point or minimize the Bellman residual? the unified oblique projection view. In Proceedings of the 27th International Conference on Machine Learning, 2010.
    • (2010) Proceedings of the 27th International Conference on Machine Learning
    • Scherrer, B.1
  • 74
    • 1942482175 scopus 로고    scopus 로고
    • Optimality of reinforcement learning algorithms with linear function approximation
    • R. Schoknecht. Optimality of reinforcement learning algorithms with linear function approximation. In Advances in Neural Information Processing Systems 15, 2002.
    • (2002) Advances in Neural Information Processing Systems , vol.15
    • Schoknecht, R.1
  • 78
    • 33847202724 scopus 로고
    • Learning to predict by the methods of temporal differences
    • R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9-44, 1988.
    • (1988) Machine Learning , vol.3 , Issue.1 , pp. 9-44
    • Sutton, R.S.1
  • 81
    • 77956513316 scopus 로고    scopus 로고
    • A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation
    • R. S. Sutton, C. Szepesvári, and H. R. Maei. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. In Advances in Neural Information Processing Systems 21, 2008.
    • (2008) Advances in Neural Information Processing Systems , vol.21
    • Sutton, R.S.1    Szepesvári, C.2    Maei, H.R.3
  • 84
    • 0000985504 scopus 로고
    • TD-gammon a self-teaching backgammon program, achieves master-level play
    • G. Tesauro. TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6(2):215-219, 1994.
    • (1994) Neural Computation , vol.6 , Issue.2 , pp. 215-219
    • Tesauro, G.1
  • 85
    • 0031143730 scopus 로고    scopus 로고
    • An analysis of temporal-difference learning with function approximation
    • PII S0018928697034375
    • J. N. Tsitsiklis and B. van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions On Automatic Control, 42(5):674-690, 1997. (Pubitemid 127760263)
    • (1997) IEEE Transactions on Automatic Control , vol.42 , Issue.5 , pp. 674-690
    • Tsitsiklis, J.N.1    Van Roy, B.2
  • 89
    • 69949155103 scopus 로고    scopus 로고
    • The composite absolute penalties family for grouped and hierarchical variable selection
    • P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A):3468-3497, 2009.
    • (2009) The Annals of Statistics , vol.37 , Issue.6 , pp. 3468-3497
    • Zhao, P.1    Rocha, G.2    Yu, B.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.