Detalhes bibliográficos
Ano de defesa: |
2022 |
Autor(a) principal: |
Lovatto, Ângelo Gregório |
Orientador(a): |
Não Informado pela instituição |
Banca de defesa: |
Não Informado pela instituição |
Tipo de documento: |
Dissertação
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
eng |
Instituição de defesa: |
Biblioteca Digitais de Teses e Dissertações da USP
|
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: |
|
Link de acesso: |
https://www.teses.usp.br/teses/disponiveis/45/45134/tde-28062022-123656/
|
Resumo: |
Stochastic Value Gradient (SVG) methods underlie many recent achievements of model-based Reinforcement Learning (RL) agents in continuous state-action spaces. Such methods use data collected by exploration in the environment to produce a model of its dynamics, which is then used to approximate the gradient of the objective function w.r.t. the agent\'s parameters. Despite the practical significance of these methods, many algorithm design choices still lack rigorous theoretical or empirical justification. Instead, most works rely heavily on benchmark-centric evaluation methods, which confound the contributions of several components of an RL agent\'s design to the final performance. In this work, we propose a fine-grained analysis of core algorithmic components of SVGs, including: the gradient estimator formula, model learning and value function approximation. We implement a configurable benchmark environment based on the Linear Quadratic Gaussian (LQG) regulator, allowing us to compute the ground-truth SVG and compare it with learning approaches. We conduct our analysis on a range of LQG environments, evaluating the impact of each algorithmic component in prediction and control tasks. Our results show that a widely used gradient estimator induces a favorable bias-variance trade-off, using a biased expectation that yields better gradient estimates in smaller sample regimes than the unbiased expression for the gradient. On model learning, we show that overfitting to on-policy data may occur, leading to accurate state predictions but inaccurate gradients, highlighting the importance of exploration even in stochastic environments. We also show that value function approximation can be more unstable than model learning, even in simple linear environments. Finally, we evaluate performance when using the model for direct gradient estimation vs. for value function approximation, concluding that the former is more effective for both prediction and control. |