Stream Ensemble: an ml model selection algorithm for stream data
Ano de defesa: | 2024 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Tese |
Tipo de acesso: | Acesso aberto |
Idioma: | eng |
Instituição de defesa: |
Laboratório Nacional de Computação Científica
Coordenação de Pós-Graduação e Aperfeiçoamento (COPGA) Brasil LNCC Programa de pós-graduação em Modelagem Computacional |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | https://tede.lncc.br/handle/tede/408 |
Resumo: | Predictive queries over spatiotemporal (ST) stream data are queries that apply predictive models to time-series data associated with specific geographic locations, with values that are continuously collected and processed. This continuous data flow often leads to dynamic and shifting data distributions that may vary significantly across space and time, exhibiting multiple distinct patterns that challenge predictive modeling. Assigning to a single machine learning model specialized in a particular data distribution the task of handling such variations often leads to failure, since such a model may not capture the diverse patterns across different spatial and temporal regions. Traditional ensemble methods, which rely on the complementary use of multiple base models, often suffer from high execution costs and suboptimal performance when dealing with ST data due to the difficulty of accurately combining the contribution of each model. In contrast, relying on a single globally trained model is frequently challenging due to several limitations: the potential lack of sufficient data, the increased complexity and difficulty in training it in comparison to local models, and the inefficiency of training a new generalist model when effective specialist models already exist. To address this challenge, we propose a more suited approach that considers each available model’s training data and their generalization error as well as the target data distributions to optimize predictive accuracy, selecting for each set of time series the most adequate model. Based on these principles, we propose StreamEnsemble, a method that implements the proposed approach. Our experimental evaluation reveals that StreamEnsemble significantly outperforms traditional ensemble methods and single-model approaches in terms of accuracy and time, demonstrating for stream data a significant reduction in prediction error of more than 10 times. |