Abordagens evolucionárias para problemas relacionados a integração de dados
Ano de defesa: | 2009 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Tese |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/SLSS-7XGGSW |
Resumo: | Data integration aims to combine data from different sources (data repositories such as databases, digital libraries, etc.) by adopting a global data model and by detecting and resolving schema and data conflicts so that a homogeneous, unified view can be provided. Two specific problems related to data integration - schema matching and replica identification - present a large solution space. This space is computationally expensive and technically prohibitive to be intensively and exhaustively explored by traditional approaches. Moreover, the solutions for these problems usually require that multiple, sometimes conflicting, objectives must be simultaneously attended. This thesis aims to show that evolutionary-based techniques can be successfully applied to such problems, leading to novel approaches and methods that address all aforementioned requirements and, at the same time, provide efficient and high accuracy solutions. In this thesis, we first propose a genetic programming approach to record deduplication. This approach combines several different pieces of evidence extracted from the actual data present in the repositories to suggest a deduplication function that is able to identify whenever two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms existing state-of-the-art methods found in the literature. Moreover, the suggested function is computationally less demanding since it uses fewer evidence. Finally, it is also important to notice that our approach is capable of automatically adapting to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter Based on the previous approach, we also devised a novel evolutionary approach that is able to automatically find complex schema matches. Our aim was to develop a method to find semantic relationships between schema elements, in a restricted scenario in which only the data instances are available. To the best of our knowledge, this is the first approach that is capable of discovering complex schema matches using only the data instances, which is performed by exploiting record deduplication and information retrieval techniques to find schema matches during the evolutionary process. To demonstrate the effectiveness of our approach, we conducted an experimental evaluation using real-world and synthetic datasets. Our results show that our approach is able to find complex matches with high accuracy, despite using only the data instances. |