Seleção automática de exemplos de treino para um método de deduplicação de registros baseado em programação genética
Ano de defesa: | 2010 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/BUBD-9JWQAQ |
Resumo: | The increasing volume of information available in digital media is becoming a challenge for administrators of large data repositories such as digital libraries and databases of large corporations. Nowadays, it is possible to say that the quality of the data used by an organization is proportional to its capacity of providing useful services to their users.Thus, companies and government institutions are investing a lot of money in developing efficient methods to identify and remove duplicates in large data repositories. Because record deduplication is a task that demands a lot of time and processing power, the proposed methods should be able to get good results as efficiently as possible.Recently, machine learning techniques have been used to deal with the record deduplication problem. However, these techniques require examples - usually generated manually - to perform a training phase necessary to learn duplication patterns from existing data, what may restrict the use of such techniques due to the cost required tocreate the training set. This MSc thesis proposes an approach that uses a deterministic technique to automatically suggest training examples for a record deduplication method based on genetic programming (GP). Experiments using synthetic data show that it is possible to use reduced training sets to faster generate deduplication functions withoutsignificantly reducing the quality of the solutions generated, even in data repositories with high levels of difficulty for deduplication. In addition, a factorial design was performed to measure the difficulty levels to deduplicate data repositories, identifying the characteristics that may affect the use of our approach to selecting training examples for the record deduphcation method based on GP. |