Stream Ensemble: an ml model selection algorithm for stream data

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Silva, Anderson Chaves da
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Laboratório Nacional de Computação Científica
Coordenação de Pós-Graduação e Aperfeiçoamento (COPGA)
Brasil
LNCC
Programa de pós-graduação em Modelagem Computacional
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://tede.lncc.br/handle/tede/408
Resumo: Predictive queries over spatiotemporal (ST) stream data are queries that apply predictive models to time-series data associated with specific geographic locations, with values that are continuously collected and processed. This continuous data flow often leads to dynamic and shifting data distributions that may vary significantly across space and time, exhibiting multiple distinct patterns that challenge predictive modeling. Assigning to a single machine learning model specialized in a particular data distribution the task of handling such variations often leads to failure, since such a model may not capture the diverse patterns across different spatial and temporal regions. Traditional ensemble methods, which rely on the complementary use of multiple base models, often suffer from high execution costs and suboptimal performance when dealing with ST data due to the difficulty of accurately combining the contribution of each model. In contrast, relying on a single globally trained model is frequently challenging due to several limitations: the potential lack of sufficient data, the increased complexity and difficulty in training it in comparison to local models, and the inefficiency of training a new generalist model when effective specialist models already exist. To address this challenge, we propose a more suited approach that considers each available model’s training data and their generalization error as well as the target data distributions to optimize predictive accuracy, selecting for each set of time series the most adequate model. Based on these principles, we propose StreamEnsemble, a method that implements the proposed approach. Our experimental evaluation reveals that StreamEnsemble significantly outperforms traditional ensemble methods and single-model approaches in terms of accuracy and time, demonstrating for stream data a significant reduction in prediction error of more than 10 times.