Data engineering pipeline as a service for MLOps initiatives

Bibliographic Details
Main Author: Fazenda, Miguel Filipe Rodrigues Almeida de Matos
Publication Date: 2023
Format: Master thesis
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10773/41929
Summary: The companies of today increasingly need to use data to ground their decisions with crucial information. To make the best use of it, data needs to be processed and analyzed for information to be extracted from it. However, information extraction from data can be a long and complex process, due to the possibility of data having enormous volumes. The handling of large volumes of data represents the concept of Big Data To stay ahead of the competition, companies need to use systems designed according to Data Engineering principles in order to handle these large volumes of data. Data Engineering is a discipline that focuses on the construction of systems that can ingest, process, and store large amounts of data. The objective of this dissertation is the construction of a system, more precisely a pipeline, that can handle large volumes of data, related to electronic products, and apply ML models on top of it to predict the next value of the intended product. The predicted values should then be stored and served to users. The system has some limitations imposed regarding the architecture and tooling, it must be based on microservices, cloud-agnostic, containerized, orchestrated, based on the SMACK stack, and use free and open-source tools. This system serves as an alternative for MLOps startups, which combine Data Engineering, DevOps, and ML to process data. The development of the system will be done with the GitOps operational framework, which applies DevOps best practices, such as versioning, compliance, collaboration, and CI/CD, and applies them to infrastructure automation. The system passed through multiple iterations, each representing a stage of the development. An initial iteration made in Docker Compose to serve as Proof of Concept, a middle one to adapt the pipeline to the Kubernetes environment and test it, and a final one on AWS through EKS, representing a real-life production scenario. The Kubernetes versions have monitoring in order to facilitate testing and observation of the system. In general, this document approaches the tools chosen, the multiple versions of the pipeline and objectives of each, and the obtained results and meaning behind them.
id RCAP_b70f185e6478adb7b01db9b2673b0d1c
oai_identifier_str oai:ria.ua.pt:10773/41929
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Data engineering pipeline as a service for MLOps initiativesAWSBig dataData engineeringDevOpsDockerGitOpsMLOpsPipelineKubernetesThe companies of today increasingly need to use data to ground their decisions with crucial information. To make the best use of it, data needs to be processed and analyzed for information to be extracted from it. However, information extraction from data can be a long and complex process, due to the possibility of data having enormous volumes. The handling of large volumes of data represents the concept of Big Data To stay ahead of the competition, companies need to use systems designed according to Data Engineering principles in order to handle these large volumes of data. Data Engineering is a discipline that focuses on the construction of systems that can ingest, process, and store large amounts of data. The objective of this dissertation is the construction of a system, more precisely a pipeline, that can handle large volumes of data, related to electronic products, and apply ML models on top of it to predict the next value of the intended product. The predicted values should then be stored and served to users. The system has some limitations imposed regarding the architecture and tooling, it must be based on microservices, cloud-agnostic, containerized, orchestrated, based on the SMACK stack, and use free and open-source tools. This system serves as an alternative for MLOps startups, which combine Data Engineering, DevOps, and ML to process data. The development of the system will be done with the GitOps operational framework, which applies DevOps best practices, such as versioning, compliance, collaboration, and CI/CD, and applies them to infrastructure automation. The system passed through multiple iterations, each representing a stage of the development. An initial iteration made in Docker Compose to serve as Proof of Concept, a middle one to adapt the pipeline to the Kubernetes environment and test it, and a final one on AWS through EKS, representing a real-life production scenario. The Kubernetes versions have monitoring in order to facilitate testing and observation of the system. In general, this document approaches the tools chosen, the multiple versions of the pipeline and objectives of each, and the obtained results and meaning behind them.As empresas de hoje necessitam cada vez mais de utilizar dados para fundamentar as suas decisões com informação crucial. Os dados necessitam ser processados e analisados para então ser possível extrair informação. Porém, o processamento de dados pode ser um processo longo e complexo devido à possibilidade do volume de dados ser enorme, este manuseamento de grandes volumes de dados reflete o conceito de Big Data. Para se manterem à frente da concurrência, as empresas precisam utilizar sistemas projetados de acordo com os princípios de Data Engineering para manusear grandes volumes de dados. Data Engineering é uma disciplina que se foca na construção de sistemas que ingerem, processam, armazenam e disponibilização grandes volumes de dados. O objetivo deste dissertação é a construção de um sistema, mais precisamente uma pipline, capaz de manusear grandes volumes de dados, estes relacionados a productos eletrónicos, e actuar sobre os dados através de modelos de ML de maneira a prever o próximo valor do produto. Os valores previstos devem então ser armazenados e disponibilizados a utilizadores. Este sistema serve com alternativa para as companhias initiantes de MLOps, que combinam Data Engineering, DevOps, e ML para processar dados. O sistema possui algumas limitações impostas quanto à arquitetura e ferramentas envolvidaas, deve ser baseado em microserviços, agnostico ao ambiente cloud, containerizado, baseado na SMACK stack e utilizar ferramentas grátis e de código aberto. O desenvolvimento do sistema será feito com a framework operacional GitOps, que aplica as melhores práticas de DevOps, como versionamento, compliança, colaboração e CI/CD, a automação de infraestrutura. O sistema passou por múltiplas iterações, cada uma representando um estágio do desenvolvimento. Uma iteração inicial feita em Docker Compose para servir como Prova de Conceito, uma intermediária para adaptar a pipeline ao ambiente Kubernetes e testá-la, e uma final na AWS através do EKS, representando um cenário de produção da vida real. As versões Kubernetes possuem monitorização de maneira a facilitar a observação e controlo do sistema. Em geral, este documento aborda as ferramentas escolhidas, as múltiplas versões da pipeline e os objetivos de cada uma, e os resultados obtidos e o significado por trás deles.2024-05-23T12:53:35Z2023-12-04T00:00:00Z2023-12-04info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10773/41929engFazenda, Miguel Filipe Rodrigues Almeida de Matosinfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-27T01:46:58Zoai:ria.ua.pt:10773/41929Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:53:00.983966Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Data engineering pipeline as a service for MLOps initiatives
title Data engineering pipeline as a service for MLOps initiatives
spellingShingle Data engineering pipeline as a service for MLOps initiatives
Fazenda, Miguel Filipe Rodrigues Almeida de Matos
AWS
Big data
Data engineering
DevOps
Docker
GitOps
MLOps
Pipeline
Kubernetes
title_short Data engineering pipeline as a service for MLOps initiatives
title_full Data engineering pipeline as a service for MLOps initiatives
title_fullStr Data engineering pipeline as a service for MLOps initiatives
title_full_unstemmed Data engineering pipeline as a service for MLOps initiatives
title_sort Data engineering pipeline as a service for MLOps initiatives
author Fazenda, Miguel Filipe Rodrigues Almeida de Matos
author_facet Fazenda, Miguel Filipe Rodrigues Almeida de Matos
author_role author
dc.contributor.author.fl_str_mv Fazenda, Miguel Filipe Rodrigues Almeida de Matos
dc.subject.por.fl_str_mv AWS
Big data
Data engineering
DevOps
Docker
GitOps
MLOps
Pipeline
Kubernetes
topic AWS
Big data
Data engineering
DevOps
Docker
GitOps
MLOps
Pipeline
Kubernetes
description The companies of today increasingly need to use data to ground their decisions with crucial information. To make the best use of it, data needs to be processed and analyzed for information to be extracted from it. However, information extraction from data can be a long and complex process, due to the possibility of data having enormous volumes. The handling of large volumes of data represents the concept of Big Data To stay ahead of the competition, companies need to use systems designed according to Data Engineering principles in order to handle these large volumes of data. Data Engineering is a discipline that focuses on the construction of systems that can ingest, process, and store large amounts of data. The objective of this dissertation is the construction of a system, more precisely a pipeline, that can handle large volumes of data, related to electronic products, and apply ML models on top of it to predict the next value of the intended product. The predicted values should then be stored and served to users. The system has some limitations imposed regarding the architecture and tooling, it must be based on microservices, cloud-agnostic, containerized, orchestrated, based on the SMACK stack, and use free and open-source tools. This system serves as an alternative for MLOps startups, which combine Data Engineering, DevOps, and ML to process data. The development of the system will be done with the GitOps operational framework, which applies DevOps best practices, such as versioning, compliance, collaboration, and CI/CD, and applies them to infrastructure automation. The system passed through multiple iterations, each representing a stage of the development. An initial iteration made in Docker Compose to serve as Proof of Concept, a middle one to adapt the pipeline to the Kubernetes environment and test it, and a final one on AWS through EKS, representing a real-life production scenario. The Kubernetes versions have monitoring in order to facilitate testing and observation of the system. In general, this document approaches the tools chosen, the multiple versions of the pipeline and objectives of each, and the obtained results and meaning behind them.
publishDate 2023
dc.date.none.fl_str_mv 2023-12-04T00:00:00Z
2023-12-04
2024-05-23T12:53:35Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10773/41929
url http://hdl.handle.net/10773/41929
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833597031249084416