Um método de agrupamento hierárquico para resolução de ambiguidade entre nomes de autores em citações bibliográficas

Detalhes bibliográficos
Ano de defesa: 2008
Autor(a) principal: Ricardo Goncalves Cota
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/SLBS-7NAEMD
Resumo: In this dissertation, we propose a heuristic-based hierarchical clustering (HHC) method to deal with the name disambiguation problem in collections of bibliographic citations. The method successively fuses clusters of citations of compatible authors based on several heuristics and similarity measures on the components of the citations (e.g., co-authors' names, title of the work, name of the publication venue). In each phase, the information of fused clusters is aggregated, providing more information for the nextround of fusion. Experiments with a dataset taken from the DBLP Computer Science Bibliography collection show gains of up to 12% against a previous method that uses the same pattern matching function but does not consider hierarchical clustering. Experiments also show gains of up to 21% against a supervised baseline, which is based on SVM and 15,5% against an unsupervised one based on K-Means. Both baselines use the same evidence considered by our method as well as privileged information about the correct number of clusters, i.e., both baselines require that the correct number of final clusters be known \textit{a priori}, which is unfeasible for large colections.We also present a new tool which uses the HHC method to deal the specific content from a DL.Finally, we present a case study where the developed tool was used to disambiguate the authors' names incitations extracted from the Brazilian Digital Library of Computing (BDBComp). The quality of the generated group in this study suggests that this tool can be used in digital libraries to help in the task of maintaining consistency of their citations. For example, appearances of an author name can be displayed in a unique format, no matter how they appear in the orginal metadata.