Efficient processing of multiway spatial join queries in distributed systems

Oliveira, Thiago Borges de

Efficient processing of multiway spatial join queries in distributed systems

Detalhes bibliográficos
Ano de defesa:	2017
Autor(a) principal:	Oliveira, Thiago Borges de
Orientador(a):	Costa, Fábio Moreira
Banca de defesa:	Costa, Fábio Moreira, Foulds, Leslie Richard, Rodrigues, Vagner José do Sacramento, Braghetto, Kelly Rosa, Meneses, Cláudio Nogueira de
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de Goiás
Programa de Pós-Graduação:	Programa de Pós-graduação em Ciência da Computação em Rede UFG/UFMS (INF)
Departamento:	Instituto de Informática - INF (RG)
País:	Brasil
Palavras-chave em Português:	Multi-junção espacial distribuída Otimizador baseado em custos Escalonamento de tarefas Histogramas
Palavras-chave em Inglês:	Distributed multiway spatial join Cost-based optimizer Job scheduling Histograms
Área do conhecimento CNPq:	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
Link de acesso:	http://repositorio.bc.ufg.br/tede/handle/tede/8033
Resumo:	Multiway spatial join is an important type of query in spatial data processing, and its efficient execution is a requirement to move spatial data analysis to scalable platforms as has already happened with relational and unstructured data. In this thesis, we provide a set of comprehensive models and methods to efficiently execute multiway spatial join queries in distributed systems. We introduce a cost-based optimizer that is able to select a good execution plan for processing such queries in distributed systems taking into account: the partitioning of data based on the spatial attributes of datasets; the intra-operator level of parallelism, which enables high scalability; and the economy of cluster resources by appropriately scheduling the queries before execution. We propose a cost model based on relevant metadata about the spatial datasets and the data distribution, which identifies the pattern of costs incurred when processing a query in this environment. We formalized the distributed multiway spatial join plan scheduling problem as a bi-objective linear integer model, considering the minimization of both the makespan and the communication cost as objectives. Three methods are proposed to compute schedules based on this model that significantly reduce the resource consumption required to process a query. Although targeting multiway spatial join query scheduling, these methods can be applied to other kinds of problems in distributed systems, notably problems that require both the alignment of data partitions and the assignment of jobs to machines. Additionally, we propose a method to control the usage of resources and increase system throughput in the presence of constraints on the network or processing capacity. The proposed cost-based optimizer was able to select good execution plans for all queries in our experiments, using public datasets with a significant range of sizes and complex spatial objects. We also present an execution engine that is capable of performing the queries with near-linear scalability with respect to execution time.

Efficient processing of multiway spatial join queries in distributed systems

Registros relacionados