메뉴 건너뛰기




Volumn 15, Issue , 2001, Pages 319-350

Infinite-horizon policy-gradient estimation

Author keywords

[No Author keywords available]

Indexed keywords

GRADIENT-BASED APPROACHES; POLICY PARAMETERS; VALUE-FUNCTION METHODS;

EID: 0013535965     PISSN: 10769757     EISSN: None     Source Type: Journal    
DOI: 10.1613/jair.806     Document Type: Article
Times cited : (676)

References (49)
  • 1
    • 0009056093 scopus 로고    scopus 로고
    • Policy-gradient learning of controllers with internal state
    • Australian National University
    • Aberdeen, D., & Baxter, J. (2001). Policy-gradient learning of controllers with internal state. Tech. rep., Australian National University.
    • (2001) Tech. Rep.
    • Aberdeen, D.1    Baxter, J.2
  • 4
    • 2542506169 scopus 로고    scopus 로고
    • Hebbian synaptic modifications in spiking neurons that learn
    • Research School of Information Sciences and Engineering, Australian National University
    • Bartlett, P. L., & Baxter, J. (1999). Hebbian synaptic modifications in spiking neurons that learn. Tech. rep., Research School of Information Sciences and Engineering, Australian National University. http://csl.anu.edu.au/∼bartlett/papers/BartlettBaxter-Nov99.ps.gz.
    • (1999) Tech. Rep.
    • Bartlett, P.L.1    Baxter, J.2
  • 5
    • 24044553495 scopus 로고    scopus 로고
    • Estimation and approximation bounds for gradient-based reinforcement learning
    • Invited Paper: Special Issue on COLT 2000
    • Bartlett, P. L., & Baxter, J. (2001). Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 62. Invited Paper: Special Issue on COLT 2000.
    • (2001) Journal of Computer and Systems Sciences , vol.62
    • Bartlett, P.L.1    Baxter, J.2
  • 8
    • 0034275416 scopus 로고    scopus 로고
    • Learning to play chess using temporal-differences
    • Baxter, J., Tridgell, A., & Weaver, L. (2000). Learning to play chess using temporal-differences. Machine Learning, 40(3), 243-263.
    • (2000) Machine Learning , vol.40 , Issue.3 , pp. 243-263
    • Baxter, J.1    Tridgell, A.2    Weaver, L.3
  • 12
    • 0032122986 scopus 로고    scopus 로고
    • Algorithms for Sensitivity Analysis of Markov Chains Through Potentials and Perturbation Realization
    • Cao, X.-R., & Wan, Y.-W. (1998). Algorithms for Sensitivity Analysis of Markov Chains Through Potentials and Perturbation Realization. IEEE Transactions on Control Systems Technology, 6, 482-492.
    • (1998) IEEE Transactions on Control Systems Technology , vol.6 , pp. 482-492
    • Cao, X.-R.1    Wan, Y.-W.2
  • 15
    • 84976859194 scopus 로고
    • Likelihood ratio gradient estimation for stochastic systems
    • Glynn, P. W. (1990). Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33, 75-84.
    • (1990) Communications of the ACM , vol.33 , pp. 75-84
    • Glynn, P.W.1
  • 16
    • 0001354607 scopus 로고
    • Likelihood ratio gradient estimation for regenerative stochastic recursions
    • (1995), 27
    • Glynn, P. W., & L'Ecuyer, P. (1995). Likelihood ratio gradient estimation for regenerative stochastic recursions. Advances in Applied Probability, 27, 4 (1995), 27, 1019-1053.
    • (1995) Advances in Applied Probability , vol.27 , Issue.4 , pp. 1019-1053
    • Glynn, P.W.1    L'Ecuyer, P.2
  • 18
    • 0000624333 scopus 로고
    • Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems
    • Tesauro, G., Touretzky, D., & Leen, T. (Eds.), MIT Press, Cambridge, MA
    • Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems. In Tesauro, G., Touretzky, D., & Leen, T. (Eds.), Advances in Neural Information Processing Systems, Vol. 7. MIT Press, Cambridge, MA.
    • (1995) Advances in Neural Information Processing Systems , vol.7
    • Jaakkola, T.1    Singh, S.P.2    Jordan, M.I.3
  • 19
    • 0008336447 scopus 로고    scopus 로고
    • An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value functions
    • Kimura, H., & Kobayashi, S. (1998a). An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value functions. In Fifteenth International Conference on Machine Learning, pp. 278-286.
    • (1998) Fifteenth International Conference on Machine Learning , pp. 278-286
    • Kimura, H.1    Kobayashi, S.2
  • 20
    • 18544383841 scopus 로고    scopus 로고
    • Reinforcement learning for continuous action using stochastic gradient ascent
    • Kimura, H., & Kobayashi, S. (1998b). Reinforcement learning for continuous action using stochastic gradient ascent. In Intelligent Autonomous Systems (IAS-5), pp. 288-295.
    • (1998) Intelligent Autonomous Systems (IAS-5) , pp. 288-295
    • Kimura, H.1    Kobayashi, S.2
  • 25
    • 0009011171 scopus 로고    scopus 로고
    • Simulation-Based Optimization of Markov Reward Processes
    • MIT
    • Marbach, P., & Tsitsiklis, J. N. (1998). Simulation-Based Optimization of Markov Reward Processes. Tech. rep., MIT.
    • (1998) Tech. Rep.
    • Marbach, P.1    Tsitsiklis, J.N.2
  • 26
    • 24044518063 scopus 로고    scopus 로고
    • Off-policy policy search
    • MIT Artificical Intelligence Laboratory
    • Meuleau, N., Peshkin, L., Kaelbling, L. P., & Kim, K.-E. (2000). Off-policy policy search. Tech. rep., MIT Artificical Intelligence Laboratory.
    • (2000) Tech. Rep.
    • Meuleau, N.1    Peshkin, L.2    Kaelbling, L.P.3    Kim, K.-E.4
  • 30
    • 0024735795 scopus 로고
    • Sensitivity analysis for simulations via likelihood ratios
    • Reiman, M. I., & Weiss, A. (1989). Sensitivity analysis for simulations via likelihood ratios. Operations Research, 37.
    • (1989) Operations Research , vol.37
    • Reiman, M.I.1    Weiss, A.2
  • 32
    • 0012260708 scopus 로고
    • How to optimize complex stochastic systems from a single sample path by the score function method
    • Rubinstein, R. Y. (1991). How to optimize complex stochastic systems from a single sample path by the score function method. Annals of Operations Research, 27, 175-211.
    • (1991) Annals of Operations Research , vol.27 , pp. 175-211
    • Rubinstein, R.Y.1
  • 33
    • 24044480220 scopus 로고
    • Decomposable score function estimators for sensitivity analysis and optimization of queueing networks
    • Rubinstein, R. Y. (1992). Decomposable score function estimators for sensitivity analysis and optimization of queueing networks. Annals of Operations Research, 39, 195-229.
    • (1992) Annals of Operations Research , vol.39 , pp. 195-229
    • Rubinstein, R.Y.1
  • 36
    • 0001201756 scopus 로고
    • Some Studies in Machine Learning Using the Game of Checkers
    • Samuel, A. L. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development, 3, 210-229.
    • (1959) IBM Journal of Research and Development , vol.3 , pp. 210-229
    • Samuel, A.L.1
  • 40
    • 0015658957 scopus 로고
    • The optimal control of partially observable Markov decision processes over a finite horizon
    • Smallwood, R. D., & Sondik, E. J. (1973). The optimal control of partially observable Markov decision processes over a finite horizon. Operations Research, 21, 1071-1098.
    • (1973) Operations Research , vol.21 , pp. 1071-1098
    • Smallwood, R.D.1    Sondik, E.J.2
  • 41
    • 0017943242 scopus 로고
    • The optimal control of partially observable Markov decision processes over the infinite horizon: Discounted costs
    • Sondik, E. J. (1978). The optimal control of partially observable Markov decision processes over the infinite horizon: Discounted costs. Operations Research, 26.
    • (1978) Operations Research , vol.26
    • Sondik, E.J.1
  • 44
    • 13444294406 scopus 로고    scopus 로고
    • A multi-agent, policy-gradient approach to network routing
    • Australian National University
    • Tao, N., Baxter, J., & Weaver, L. (2001). A multi-agent, policy-gradient approach to network routing. Tech. rep., Australian National University.
    • (2001) Tech. Rep.
    • Tao, N.1    Baxter, J.2    Weaver, L.3
  • 45
    • 0001046225 scopus 로고
    • Practical Issues in Temporal Difference Learning
    • Tesauro, G. (1992). Practical Issues in Temporal Difference Learning. Machine Learning, 8, 257-278.
    • (1992) Machine Learning , vol.8 , pp. 257-278
    • Tesauro, G.1
  • 46
    • 0000985504 scopus 로고
    • TD-Gammon, a self-teaching backgammon program, achieves master-level play
    • Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, 6, 215-219.
    • (1994) Neural Computation , vol.6 , pp. 215-219
    • Tesauro, G.1
  • 47
    • 0031143730 scopus 로고    scopus 로고
    • An Analysis of Temporal Difference Learning with Function Approximation
    • Tsitsikilis, J. N., & Van-Roy, B. (1997). An Analysis of Temporal Difference Learning with Function Approximation. IEEE Transactions on Automatic Control, 42(5), 674-690.
    • (1997) IEEE Transactions on Automatic Control , vol.42 , Issue.5 , pp. 674-690
    • Tsitsikilis, J.N.1    Van-Roy, B.2
  • 48
    • 0000337576 scopus 로고
    • Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning
    • Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8, 229-256.
    • (1992) Machine Learning , vol.8 , pp. 229-256
    • Williams, R.J.1


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.