Proposta de uma base de citações da literatura científica por meio da extração automática de dados do SciELO: por meio da extração automática de dados do SciELO

Detalhes bibliográficos
Ano de defesa: 2013
Autor(a) principal: Max Cirino de Mattos
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/ECIC-9CPH3G
Resumo: Several authors emphasize the importance of creating a citation index - such as the Science Citation Index (SCI) as an instrument for the production of national science policies and therefore for the promotion of local scientific development in less developed countries. The automatic retrieval of metadata of articles and references cited available in eXtensible Markup Language (XML) files to create this kind of index - using Scientific Electronic Library Online (SciELO) as a primary source - represents an important initial step for creating a Web of Science for Latin America and the Caribbean. The methodology used is based upon theautomatic generation of such citations, and this research analyzes the results found in the initial stages of this methodology - identification of journals; obtaining the annual statistical data (source data) for each journal, the identification of areas of knowledge for each journal and the creation of the database module "Registration Data" - and the three final stages: identification and storage of XML files available in SciELO; interpreting these files for extracting metadata and information about each cited reference and the storage of all information from each XML file in the database module "Citation Index". The initial test of the prototype built was performed with the journal "Perspectives in Information cience" (PIS), presenting the analysis of 24 issues, 300 articles, 7,714 citations, 579 abstracts, 587 titles, 2,358 keywords, 686 authors of articles and 10,394 authors identified in citations. The validation of the prototype was performed with the Public Health Collection resulting in 14 journals, 14 publishers, 1,335 issues, 23,780 articles, 491,739 citations, 37,124 abstracts, 44,696 titles, 149,874 keywords, 73,859 authors of articles and 1,240,734 authors identified in citations. There were no disambiguation procedures for names of authors or sources. The differences between the values provided by the source data of SciELO for each journal andthe numbers collected from the interpretation of the XML files are explained and some solutions are proposed. The high success rate in identifying metadata and citations from XML files proved the effectiveness of the prototype. Among the problems identified, one to highlight was the difference between the source data for the same ISSN in differentcollections. More details about how SciELO calculates the number of issues, articles and citations need to be investigated for the analysis of the differences found. It is intended to provide the citation index generated for PIS on its website. Another research study is underway which seeks to obtain all the XML files from listed collections of SciELO in order to construct a citation index for Latin America, the Caribbean and other collections of SciELO.