Enriching data analytics with incremental data cleaning and attribute domain management

Detalhes bibliográficos
Ano de defesa: 2021
Autor(a) principal: Oliveira, Paulo Henrique de
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16072021-120503/
Resumo: In the present Big Data era, many businesses have become more data-driven, seeking to improve their decision-making processes based on solid Data Analytics practices. Several steps constitute the Data Analytics pipeline and all of them involve specific approaches and technologies, which are constantly evolving. In order to accommodate new needs and trends, there is always room for improvements in the steps of Data Analytics. In this context, this PhD research has focused on improving two of those steps: (i) data cleaning and (ii) data analysis. Regarding the first step, we addressed the problem of performing data cleaning incrementally, considering dynamic scenarios with incoming data batches, as well as holistically, that is, jointly taking into account multiple error detection criteria. As a result, we have developed an incremental data cleaning framework which significantly outperforms competitors, enabling higher efficiency while compromising little on repair quality, as well as addresses the problem in an innovative way, hence filling a gap in the literature. Regarding the second improved step, we addressed the problem of handling queries over an Attribute Domain, which consists of the set of stored values within a domain of attributes, usually across multiple relations. As a result, we have proposed three contributions: (a) the Domain Index, an access method for efficiently performing queries over Attribute Domains, which we refer to as Domain Queries; (b) a comprehensive case study of Domain Indexes applied to the medical domain, focusing on content-based Domain Queries for supporting physicians in decision-making; and (c) an approach for including support to Attribute Domains as first-class citizens in a Relational Database Management System (RDBMS). Together, those contributions target a distinct category of queries which, until the execution of this PhD research, had not been addressed in the literature elsewhere. Experimental results highlight the superior performance enabled by the Domain Index compared to existing techniques of modern RDBMSs, which not only are inefficient in several scenarios, but also are not always applicable. Ultimately, those contributions enrich data analyses down the road. Hence, this PhD research advances the state of the art in the field of Data Analytics, as well as opens several directions for future work.