Algoritmo paralelo e eficiente para o problema de pareamento de dados

Detalhes bibliográficos
Ano de defesa: 2008
Autor(a) principal: Walter dos Santos Filho
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/RVMR-7L3Q3V
Resumo: In a world where the information is becoming more important each day, the availability of reliable and consistent databases is essential for decision-making, trend analysis, fraud detection, data mining, customer support, and business intelligence, among other data-intensive applications. In order to sustain data quality standards, it is frequentlynecessary to discard replicas and consolidate the information.In this work we introduce a tool named FERAPARDA (from the Portuguese acronym for \tool for record linkage"). It allows the combination of information from several sources through probabilistic record linkage. The linkage process is based on building and comparing pairs of records in a per attribute basis, that is, matching names, addresses and other attributes that are not unique identiers, and nding replicas probabilistically. Large databases containing thousands and even millions of records are quite common, and they usually present several problems such as missing and inconsistent data, input errors or even replicated information. These problems and the database size result in a need for comparing a large number of pairs of records (presenting a quadratic complexity in the worst case), making the process laborious and time-consuming for the execution in a single machine. Generally, the linkage process is calibrated iteratively,as a consequence of database characteristics, such as very frequent names or challenging pseudo-replicas, such as records from twins.There are several tools that perform probabilistic linkage of records. However, few eorts discuss the process parallelization, what is even more importante for real datasets. In order to reduce the execution time, we discuss parallelization strategies of the record linkage algorithm. We present and d1iscuss each step in the linkage process and how it was parallelized. We were succesful in the sense that our solution scaleswell in computing clusters. This work also discusses various parallelization issues applied to the problem and how the reference locality may be exploited towards maximizing performance withoutrequiring a large amount of resources, in particular memory. We show that the usage of a communication cache is key for the scalability of the algorithm and how one of the linkage steps, blocking, is fundamental in this work. We believe that FERAPARDA is capable of performing the linkage of various databases, from commercial to health records, enhancing the quality of the data and the services that are based on that information.