O comportamento de termos da Ciência da Informação por meio da modelagem de tópicos

Detalhes bibliográficos
Ano de defesa: 2020
Autor(a) principal: Marcos de Souza
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
Brasil
ECI - ESCOLA DE CIENCIA DA INFORMAÇÃO
Programa de Pós-Graduação em Gestão e Organização do Conhecimento
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/34292
https://orcid.org/0000-0002-9829-7249
Resumo: The growth of research, science and technology from an academic perspective has contributed to the production of a large amount of scientific information produced in various formats and types of scientific communication documents. Considering the amount, variety and complexity of information produced, it has been increasingly necessary to use technologies and methods for the elaboration and production of information records, in addition to the need to produce information about information. The Topic Modeling consisting of statistical / probabilistic methods and technological resources uses models of learning algorithms that make it possible to identify patterns, organize collections, summarize content, extract more frequent topics, identify relationships between issues and changes made over time in corpora of documents. Based on this principle, the question is: in what way has the themes of Brazilian scientific production in the area of Information Science been presented in the second decade of the XXI century when comparing the areas and disciplines already established in the literature by researchers as the core of the area? The general objective was to verify the proximity and the distance between the themes extracted from the data corps constituted by scientific documents and the areas and disciplines of Information Science established in the literature. Among the specific objectives were to identify, analyze and discuss the diachronic behavior of the terms extracted from the data corpora, as well as their respective relationships, and to analyze and discuss the training models for topic extraction, to select the significant results and to validate them with the Brazilian scientific community of Information Science. The importance of this research is justified since the comparison between studies, even if using different methodologies and time intervals in the composition of documents, allows presenting, through scientific mapping, new results and prospecting different scenarios and perspectives for the studied science. For the empirical research were carried out the steps data collection and formation of data corpora, preparation and pre-processing referring to cleaning, manipulation, combination and normalization of data, transformation of the data referring to mathematical operations and applied statistics, modeling and processing to which connects the data treated with the Latent Semantic Indexing models, and Latent Dirichlet Allocation, presentation of the results through textual synthesis and interactive graphics and statistics, validation of the results with researchers in the studied area and documentation generated from the empirical results with the theoretical reference. Among the main results are the partially different behavior between the scientific mapping of the disciplines of the Information Science core found in the literature with the empirical results of this research; diachronic behavior and emergence of terms in research in the area of Information Science such as fake news, big data and machine learning; Proximity and distance between disciplines such as Information Systems and Electronic Scientific Communication; Better results in the modeling of topics using the Latent Dirichlet Allocation model taking into account the balance between the weights of the results and a greater number of bigrams and trigrams that contribute to a better interpretation of the data carried out by the indexer and validated by the scientific community.