Hive on spark and MapReduce : a methodology for parameter tuning

Bibliographic Details
Main Author: Forster, Rodrigo Richard
Publication Date: 2018
Format: Master thesis
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10362/52854
Summary: Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management
id RCAP_ca2dad9351d49acf1ad2b7b94630f2a4
oai_identifier_str oai:run.unl.pt:10362/52854
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Hive on spark and MapReduce : a methodology for parameter tuningTuningHive on SparkMapReduceApache SparkBig DataHDFSHadoopData WarehouseProject Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster.Santos, Vitor Manuel Pereira Duarte dosRUNForster, Rodrigo Richard2018-11-26T14:59:01Z2018-10-292018-10-29T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/52854TID:202028755enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-22T17:35:44Zoai:run.unl.pt:10362/52854Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:06:53.036272Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Hive on spark and MapReduce : a methodology for parameter tuning
title Hive on spark and MapReduce : a methodology for parameter tuning
spellingShingle Hive on spark and MapReduce : a methodology for parameter tuning
Forster, Rodrigo Richard
Tuning
Hive on Spark
MapReduce
Apache Spark
Big Data
HDFS
Hadoop
Data Warehouse
title_short Hive on spark and MapReduce : a methodology for parameter tuning
title_full Hive on spark and MapReduce : a methodology for parameter tuning
title_fullStr Hive on spark and MapReduce : a methodology for parameter tuning
title_full_unstemmed Hive on spark and MapReduce : a methodology for parameter tuning
title_sort Hive on spark and MapReduce : a methodology for parameter tuning
author Forster, Rodrigo Richard
author_facet Forster, Rodrigo Richard
author_role author
dc.contributor.none.fl_str_mv Santos, Vitor Manuel Pereira Duarte dos
RUN
dc.contributor.author.fl_str_mv Forster, Rodrigo Richard
dc.subject.por.fl_str_mv Tuning
Hive on Spark
MapReduce
Apache Spark
Big Data
HDFS
Hadoop
Data Warehouse
topic Tuning
Hive on Spark
MapReduce
Apache Spark
Big Data
HDFS
Hadoop
Data Warehouse
description Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management
publishDate 2018
dc.date.none.fl_str_mv 2018-11-26T14:59:01Z
2018-10-29
2018-10-29T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/52854
TID:202028755
url http://hdl.handle.net/10362/52854
identifier_str_mv TID:202028755
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833596443640725504