Hive on spark and MapReduce : a methodology for parameter tuning
Main Author: | |
---|---|
Publication Date: | 2018 |
Format: | Master thesis |
Language: | eng |
Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
Download full: | http://hdl.handle.net/10362/52854 |
Summary: | Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management |
id |
RCAP_ca2dad9351d49acf1ad2b7b94630f2a4 |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/52854 |
network_acronym_str |
RCAP |
network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository_id_str |
https://opendoar.ac.uk/repository/7160 |
spelling |
Hive on spark and MapReduce : a methodology for parameter tuningTuningHive on SparkMapReduceApache SparkBig DataHDFSHadoopData WarehouseProject Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster.Santos, Vitor Manuel Pereira Duarte dosRUNForster, Rodrigo Richard2018-11-26T14:59:01Z2018-10-292018-10-29T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/52854TID:202028755enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-22T17:35:44Zoai:run.unl.pt:10362/52854Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:06:53.036272Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
dc.title.none.fl_str_mv |
Hive on spark and MapReduce : a methodology for parameter tuning |
title |
Hive on spark and MapReduce : a methodology for parameter tuning |
spellingShingle |
Hive on spark and MapReduce : a methodology for parameter tuning Forster, Rodrigo Richard Tuning Hive on Spark MapReduce Apache Spark Big Data HDFS Hadoop Data Warehouse |
title_short |
Hive on spark and MapReduce : a methodology for parameter tuning |
title_full |
Hive on spark and MapReduce : a methodology for parameter tuning |
title_fullStr |
Hive on spark and MapReduce : a methodology for parameter tuning |
title_full_unstemmed |
Hive on spark and MapReduce : a methodology for parameter tuning |
title_sort |
Hive on spark and MapReduce : a methodology for parameter tuning |
author |
Forster, Rodrigo Richard |
author_facet |
Forster, Rodrigo Richard |
author_role |
author |
dc.contributor.none.fl_str_mv |
Santos, Vitor Manuel Pereira Duarte dos RUN |
dc.contributor.author.fl_str_mv |
Forster, Rodrigo Richard |
dc.subject.por.fl_str_mv |
Tuning Hive on Spark MapReduce Apache Spark Big Data HDFS Hadoop Data Warehouse |
topic |
Tuning Hive on Spark MapReduce Apache Spark Big Data HDFS Hadoop Data Warehouse |
description |
Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management |
publishDate |
2018 |
dc.date.none.fl_str_mv |
2018-11-26T14:59:01Z 2018-10-29 2018-10-29T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/52854 TID:202028755 |
url |
http://hdl.handle.net/10362/52854 |
identifier_str_mv |
TID:202028755 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
repository.mail.fl_str_mv |
info@rcaap.pt |
_version_ |
1833596443640725504 |