Detalhes bibliográficos
Ano de defesa: |
2012 |
Autor(a) principal: |
Lopes, Lucelene
 |
Orientador(a): |
Vieira, Renata
 |
Banca de defesa: |
Não Informado pela instituição |
Tipo de documento: |
Tese
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
por |
Instituição de defesa: |
Pontifícia Universidade Católica do Rio Grande do Sul
|
Programa de Pós-Graduação: |
Programa de Pós-Graduação em Ciência da Computação
|
Departamento: |
Faculdade de Informáca
|
País: |
BR
|
Palavras-chave em Português: |
|
Área do conhecimento CNPq: |
|
Link de acesso: |
http://tede2.pucrs.br/tede2/handle/tede/5175
|
Resumo: |
This thesis describes a process to extract concepts from texts in portuguese language. The proposed process starts with linguistic annotated corpora from specific domains, and it generates lists of concepts for each corpus. The proposal of a linguistic oriented extraction procedure based on noun phrase detection, and a set of heuristics to improve the overall quality of concept candidate extraction is made. The improvement in precision and recall of extracted term list is from approximatively from 10% to more more than 60%. A new index (tf-dcf) based on contrastive corpora is proposed to sort the concept candidate terms according to the their relevance to their respective domain. The precision results achieved by this new index are superior to to the results achieved by indices proposed in similar works. Cut-off points are proposed in order to identify, among extracted concept candidate terms sorted according to their relevance, which of them will be considered concepts. A hybrid approach to choose cut-off points delivers reasonable F-measure values, and it brings quality to the concept identification process. Additionally, four applications are proposed in order to facilitate the comprehension, handling, and visualization of extracted terms and concepts. Such applications enlarge this thesis contributions available to a broader community of researchers and users of Natural Language Processing area. The proposed process is described in detail, and experiments empirically evaluate each process step. Besides the scientific contribution made with the process proposal, this thesis also delivers extracted concept lists for five different domain corpora, and the prototype of a software tool (EχATOLP) implementing all steps of the proposed process. |