A data-driven systematic, consistent and non-exhaustive approach to Model Selection

Detalhes bibliográficos
Ano de defesa: 2022
Autor(a) principal: Marcondes, Diego Ribeiro
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/45/45132/tde-09082022-154351/
Resumo: Modern science consists on conceiving a set of hypotheses to explain observable phenomena, confronting them with reality, and keeping as possible explanations only hypotheses which have not yet been falsified. Such a set of hypotheses is called a model, hence an important step of the scientific method is to select a model. Under a Statistical Learning framework, this consists on selecting a model among candidates based on quantitative evidence, and then learning hypotheses on it by the minimization of an empirical risk function. The need to select a model, rather than considering the union of the candidates as the possible hypotheses, is the liability to overfitting, from which arises a complexity-bias trade-off. If we choose a highly complex model, then we may have in it hypotheses which explain the underlying process very well, but there may also be hypotheses which explain the empirical data very well, and it is not clear how to separate them, so we overfit the data. If we choose a simpler model, it may happen that the hypotheses which well fit empirical data are the same that better explain the process, but may not explain it very well, as there may be hypotheses not in the model which are better, so there is a bias when learning on this model. Therefore, properly choosing the model is an important part of the solution of a learning problem, and is performed via Model Selection. This thesis proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection. The main feature of the approach are the collections of candidate models, which we call Learning Spaces, that, when seen as a set partially ordered by inclusion, may have a rich structure which enhance the quality of learning. The approach is data-driven since the only features which are chosen are the Learning Space and risk function, so all other features are based on data. It is systematic since it is constituted of a formal system of two steps: select a model from the Learning Space and then learn hypotheses on it. From a statistical point of view, there is a target model among the candidates, which is that with the lowest bias and complexity, and the approach is consistent since, as the sample size increases, the selected model converges to the target with probability one, and the estimation errors related to the learning of hypotheses on it converge in probability to zero. We establish U-curve properties of the Learning Spaces which imply the existence of U-curve algorithms that can optimally estimate the target model without an exhaustive search, which can also be efficiently implemented to obtain suboptimal solutions. The main implication of the approach are instances in which the lack of data may be mitigated by high computational power, a property which may be behind the high-performance computing demanding modern learning methods. We illustrate the approach on simulated and real data to learn on the important Partition Lattice Learning Space, to forecast binary sequences under a Markov Chain framework, to learn multilayer W-operators, and to filter binary images via the learning of interval Boolean functions.