Uma proposta de metodologia para escolha automática de descritores utilizando sintagmas nominais
Ano de defesa: | 2005 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Tese |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/RRSA-6GGGUF |
Resumo: | Since manual indexing was found impossible for some document processing contexts, researchers seek alternatives to represent documents subjects automatically. The most common processes try to determine documents subjects through the analysis of words' frequencies. Searching for a better indexing process which analyses words and expressions within their linguistics contexts, three assumptions are made: (1) using noun phrases as descriptors is better than using keywords; (2) the extraction of the noun phrases from digitalized textual documents is possible and viable with the software tools available and (3) it is possible to establish an automated and functional process to choose good descriptors for documents using noun phrases. The aim of this research was to develop a methodology that would enable the indexation of digitalized documents through the extraction of the noun phrases and analysis of characteristics such as: (1) the frequency of occurrence of the noun phrases in the text of the document; (2) The frequency of occurrence in the whole set of documents; (3) the structure of the noun phrase; (4) the level of the noun phrase and (5) the occurrence of the noun phrase in a thesaurus of the subjects field. In order to reach this goal, the following pieces were analyzed (a) a corpus made of 15 documents from winch the noun phrases were extracted manually, to test the automatic extraction and (b) a corpus made of 60 documents coming from the field of information science. The methodology proposed was applied initially to part of the corpus for validation and calibration purposes, and then it was again applied, with some changes, to the whole corpus. The results presented showed a great deal of adequateness of the descriptors associated to the documents and this led to the conclusion that the methodology is unequivocally successful in the studied conditions. |