Static Analysis for Detection of Defects in Machine Learning Pipelines

Detalhes bibliográficos
Autor(a) principal: Silva, Pedro Miguel Alcântara da
Data de Publicação: 2024
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Texto Completo: http://hdl.handle.net/10400.5/97300
Resumo: Tese de mestrado, Engenharia Informática, 2024, Universidade de Lisboa, Faculdade de Ciências
id RCAP_2fff03a0082867dd9304c2757f65d1a2
oai_identifier_str oai:repositorio.ulisboa.pt:10400.5/97300
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Static Analysis for Detection of Defects in Machine Learning PipelinesVerificação EstáticaLinguagem Específica de DomínioAprendizagem AutomáticaPipelineEspecificação FormalTeses de mestrado - 2024Departamento de InformáticaTese de mestrado, Engenharia Informática, 2024, Universidade de Lisboa, Faculdade de CiênciasMachine Learning is becoming ubiquitous, with its techniques finding usage in every part of society. We are now witnessing an explosion in ML-based tools, such as the popular ChatGPT, made possible by advances in hardware that enable large-scale data processing. Most importantly, the rise of Machine Learning is related to the release of multiple frameworks and libraries that abstract its complexities, thus increasing its accessibility. These tools are used to implement the pipelines that automate the necessary workflow to create an ML mode, from data preprocessing to model learning and evaluation. However, these pipelines can contain domain-specific defects that are not trivial to be found by looking at the code. These defects are caused by flawed methodologies related to the semantics of pipeline components, data or other concepts specific to data science. An example of such a defect is the incorrect handling of time-series data when building datasets, such as shuffling time-series instances before the train/test splitting. Semantic defects are difficult to detect and prevent, reaching production silently, thus causing training-serving skew. Unfortunately, unlike typical software development, pipeline testing is not feasible, forcing us to explore alternatives. With a focus on supervised machine learning, this work identified relevant semantic defects, resorting to the community of ML developers, data scientists, and the academic and grey literature. To tackle the defects, we developed a domain-specific language capable of describing pipeline structure and the properties of its components and data sources. We also created a static analyser to automate defect detection in pipelines specified using the DSL. The verification process relies on the formal specification of pipeline components. We modelled pipelines containing the relevant defects we identified to evaluate the solution. The solution successfully detected all the defects present in the pipelines.Fonseca, Alcides Miguel Cachulo AguiarLopes, Maria Antónia Bacelar da Costa, 1968-Repositório da Universidade de LisboaSilva, Pedro Miguel Alcântara da2025-01-17T12:48:25Z202420242024-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10400.5/97300TID:203875524enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-17T16:32:05Zoai:repositorio.ulisboa.pt:10400.5/97300Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T04:18:29.618465Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Static Analysis for Detection of Defects in Machine Learning Pipelines
title Static Analysis for Detection of Defects in Machine Learning Pipelines
spellingShingle Static Analysis for Detection of Defects in Machine Learning Pipelines
Silva, Pedro Miguel Alcântara da
Verificação Estática
Linguagem Específica de Domínio
Aprendizagem Automática
Pipeline
Especificação Formal
Teses de mestrado - 2024
Departamento de Informática
title_short Static Analysis for Detection of Defects in Machine Learning Pipelines
title_full Static Analysis for Detection of Defects in Machine Learning Pipelines
title_fullStr Static Analysis for Detection of Defects in Machine Learning Pipelines
title_full_unstemmed Static Analysis for Detection of Defects in Machine Learning Pipelines
title_sort Static Analysis for Detection of Defects in Machine Learning Pipelines
author Silva, Pedro Miguel Alcântara da
author_facet Silva, Pedro Miguel Alcântara da
author_role author
dc.contributor.none.fl_str_mv Fonseca, Alcides Miguel Cachulo Aguiar
Lopes, Maria Antónia Bacelar da Costa, 1968-
Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Silva, Pedro Miguel Alcântara da
dc.subject.por.fl_str_mv Verificação Estática
Linguagem Específica de Domínio
Aprendizagem Automática
Pipeline
Especificação Formal
Teses de mestrado - 2024
Departamento de Informática
topic Verificação Estática
Linguagem Específica de Domínio
Aprendizagem Automática
Pipeline
Especificação Formal
Teses de mestrado - 2024
Departamento de Informática
description Tese de mestrado, Engenharia Informática, 2024, Universidade de Lisboa, Faculdade de Ciências
publishDate 2024
dc.date.none.fl_str_mv 2024
2024
2024-01-01T00:00:00Z
2025-01-17T12:48:25Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.5/97300
TID:203875524
url http://hdl.handle.net/10400.5/97300
identifier_str_mv TID:203875524
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833602013376544768