Data engineering pipeline as a service for MLOps initiatives

Fazenda, Miguel Filipe Rodrigues Almeida de Matos

Data engineering pipeline as a service for MLOps initiatives

Bibliographic Details
Main Author:	Fazenda, Miguel Filipe Rodrigues Almeida de Matos
Publication Date:	2023
Format:	Master thesis
Language:	eng
Source:	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full:	http://hdl.handle.net/10773/41929
Summary:	The companies of today increasingly need to use data to ground their decisions with crucial information. To make the best use of it, data needs to be processed and analyzed for information to be extracted from it. However, information extraction from data can be a long and complex process, due to the possibility of data having enormous volumes. The handling of large volumes of data represents the concept of Big Data To stay ahead of the competition, companies need to use systems designed according to Data Engineering principles in order to handle these large volumes of data. Data Engineering is a discipline that focuses on the construction of systems that can ingest, process, and store large amounts of data. The objective of this dissertation is the construction of a system, more precisely a pipeline, that can handle large volumes of data, related to electronic products, and apply ML models on top of it to predict the next value of the intended product. The predicted values should then be stored and served to users. The system has some limitations imposed regarding the architecture and tooling, it must be based on microservices, cloud-agnostic, containerized, orchestrated, based on the SMACK stack, and use free and open-source tools. This system serves as an alternative for MLOps startups, which combine Data Engineering, DevOps, and ML to process data. The development of the system will be done with the GitOps operational framework, which applies DevOps best practices, such as versioning, compliance, collaboration, and CI/CD, and applies them to infrastructure automation. The system passed through multiple iterations, each representing a stage of the development. An initial iteration made in Docker Compose to serve as Proof of Concept, a middle one to adapt the pipeline to the Kubernetes environment and test it, and a final one on AWS through EKS, representing a real-life production scenario. The Kubernetes versions have monitoring in order to facilitate testing and observation of the system. In general, this document approaches the tools chosen, the multiple versions of the pipeline and objectives of each, and the obtained results and meaning behind them.

Item metadata

id	RCAP_b70f185e6478adb7b01db9b2673b0d1c
oai_identifier_str	oai:ria.ua.pt:10773/41929
network_acronym_str	RCAP
network_name_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str	https://opendoar.ac.uk/repository/7160
spelling	Data engineering pipeline as a service for MLOps initiativesAWSBig dataData engineeringDevOpsDockerGitOpsMLOpsPipelineKubernetesThe companies of today increasingly need to use data to ground their decisions with crucial information. To make the best use of it, data needs to be processed and analyzed for information to be extracted from it. However, information extraction from data can be a long and complex process, due to the possibility of data having enormous volumes. The handling of large volumes of data represents the concept of Big Data To stay ahead of the competition, companies need to use systems designed according to Data Engineering principles in order to handle these large volumes of data. Data Engineering is a discipline that focuses on the construction of systems that can ingest, process, and store large amounts of data. The objective of this dissertation is the construction of a system, more precisely a pipeline, that can handle large volumes of data, related to electronic products, and apply ML models on top of it to predict the next value of the intended product. The predicted values should then be stored and served to users. The system has some limitations imposed regarding the architecture and tooling, it must be based on microservices, cloud-agnostic, containerized, orchestrated, based on the SMACK stack, and use free and open-source tools. This system serves as an alternative for MLOps startups, which combine Data Engineering, DevOps, and ML to process data. The development of the system will be done with the GitOps operational framework, which applies DevOps best practices, such as versioning, compliance, collaboration, and CI/CD, and applies them to infrastructure automation. The system passed through multiple iterations, each representing a stage of the development. An initial iteration made in Docker Compose to serve as Proof of Concept, a middle one to adapt the pipeline to the Kubernetes environment and test it, and a final one on AWS through EKS, representing a real-life production scenario. The Kubernetes versions have monitoring in order to facilitate testing and observation of the system. In general, this document approaches the tools chosen, the multiple versions of the pipeline and objectives of each, and the obtained results and meaning behind them.As empresas de hoje necessitam cada vez mais de utilizar dados para fundamentar as suas decisões com informação crucial. Os dados necessitam ser processados e analisados para então ser possível extrair informação. Porém, o processamento de dados pode ser um processo longo e complexo devido à possibilidade do volume de dados ser enorme, este manuseamento de grandes volumes de dados reflete o conceito de Big Data. Para se manterem à frente da concurrência, as empresas precisam utilizar sistemas projetados de acordo com os princípios de Data Engineering para manusear grandes volumes de dados. Data Engineering é uma disciplina que se foca na construção de sistemas que ingerem, processam, armazenam e disponibilização grandes volumes de dados. O objetivo deste dissertação é a construção de um sistema, mais precisamente uma pipline, capaz de manusear grandes volumes de dados, estes relacionados a productos eletrónicos, e actuar sobre os dados através de modelos de ML de maneira a prever o próximo valor do produto. Os valores previstos devem então ser armazenados e disponibilizados a utilizadores. Este sistema serve com alternativa para as companhias initiantes de MLOps, que combinam Data Engineering, DevOps, e ML para processar dados. O sistema possui algumas limitações impostas quanto à arquitetura e ferramentas envolvidaas, deve ser baseado em microserviços, agnostico ao ambiente cloud, containerizado, baseado na SMACK stack e utilizar ferramentas grátis e de código aberto. O desenvolvimento do sistema será feito com a framework operacional GitOps, que aplica as melhores práticas de DevOps, como versionamento, compliança, colaboração e CI/CD, a automação de infraestrutura. O sistema passou por múltiplas iterações, cada uma representando um estágio do desenvolvimento. Uma iteração inicial feita em Docker Compose para servir como Prova de Conceito, uma intermediária para adaptar a pipeline ao ambiente Kubernetes e testá-la, e uma final na AWS através do EKS, representando um cenário de produção da vida real. As versões Kubernetes possuem monitorização de maneira a facilitar a observação e controlo do sistema. Em geral, este documento aborda as ferramentas escolhidas, as múltiplas versões da pipeline e os objetivos de cada uma, e os resultados obtidos e o significado por trás deles.2024-05-23T12:53:35Z2023-12-04T00:00:00Z2023-12-04info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10773/41929engFazenda, Miguel Filipe Rodrigues Almeida de Matosinfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-27T01:46:58Zoai:ria.ua.pt:10773/41929Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:53:00.983966Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv	Data engineering pipeline as a service for MLOps initiatives
title	Data engineering pipeline as a service for MLOps initiatives
spellingShingle	Data engineering pipeline as a service for MLOps initiatives Fazenda, Miguel Filipe Rodrigues Almeida de Matos AWS Big data Data engineering DevOps Docker GitOps MLOps Pipeline Kubernetes
title_short	Data engineering pipeline as a service for MLOps initiatives
title_full	Data engineering pipeline as a service for MLOps initiatives
title_fullStr	Data engineering pipeline as a service for MLOps initiatives
title_full_unstemmed	Data engineering pipeline as a service for MLOps initiatives
title_sort	Data engineering pipeline as a service for MLOps initiatives
author	Fazenda, Miguel Filipe Rodrigues Almeida de Matos
author_facet	Fazenda, Miguel Filipe Rodrigues Almeida de Matos
author_role	author
dc.contributor.author.fl_str_mv	Fazenda, Miguel Filipe Rodrigues Almeida de Matos
dc.subject.por.fl_str_mv	AWS Big data Data engineering DevOps Docker GitOps MLOps Pipeline Kubernetes
topic	AWS Big data Data engineering DevOps Docker GitOps MLOps Pipeline Kubernetes
description	The companies of today increasingly need to use data to ground their decisions with crucial information. To make the best use of it, data needs to be processed and analyzed for information to be extracted from it. However, information extraction from data can be a long and complex process, due to the possibility of data having enormous volumes. The handling of large volumes of data represents the concept of Big Data To stay ahead of the competition, companies need to use systems designed according to Data Engineering principles in order to handle these large volumes of data. Data Engineering is a discipline that focuses on the construction of systems that can ingest, process, and store large amounts of data. The objective of this dissertation is the construction of a system, more precisely a pipeline, that can handle large volumes of data, related to electronic products, and apply ML models on top of it to predict the next value of the intended product. The predicted values should then be stored and served to users. The system has some limitations imposed regarding the architecture and tooling, it must be based on microservices, cloud-agnostic, containerized, orchestrated, based on the SMACK stack, and use free and open-source tools. This system serves as an alternative for MLOps startups, which combine Data Engineering, DevOps, and ML to process data. The development of the system will be done with the GitOps operational framework, which applies DevOps best practices, such as versioning, compliance, collaboration, and CI/CD, and applies them to infrastructure automation. The system passed through multiple iterations, each representing a stage of the development. An initial iteration made in Docker Compose to serve as Proof of Concept, a middle one to adapt the pipeline to the Kubernetes environment and test it, and a final one on AWS through EKS, representing a real-life production scenario. The Kubernetes versions have monitoring in order to facilitate testing and observation of the system. In general, this document approaches the tools chosen, the multiple versions of the pipeline and objectives of each, and the obtained results and meaning behind them.
publishDate	2023
dc.date.none.fl_str_mv	2023-12-04T00:00:00Z 2023-12-04 2024-05-23T12:53:35Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10773/41929
url	http://hdl.handle.net/10773/41929
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP
instname_str	FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv	info@rcaap.pt
_version_	1833597031249084416

Data engineering pipeline as a service for MLOps initiatives

Similar Items