Metodologia computacional para identificação de sintagmas nominais da língua portuguesa
Ano de defesa: | 2010 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal do Espírito Santo
BR Mestrado em Informática Centro Tecnológico UFES Programa de Pós-Graduação em Informática |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://repositorio.ufes.br/handle/10/4217 |
Resumo: | In Portuguese language, syntagmas are units of meaning and with syntactic function in a phrase [Nicola, 2008]. Generally speaking, phrases that compose any enunciate express some content through their elements and these elements combinations that the language allows. Therefore, sets and subsets are made and they work as syntactic units in the bigger unit which is the phrase - the syntagmas, that can be separated in noun phrase and verb phrase. Among those, the noun phrases represent a bigger interest due to the biggest semantic value in it. Noun phrases are used in Natural Language Processing (NLP) tasks, such as resolving co-references (anaphora), automatic building of ontologies, in parsers used in medical texts to generate resumes and vocabulary building, or as an initial part in syntactic analyses processes. In Information Retrieval, noun phrases can be applied as atomic terms in indexing systems and documents search, delivering better results. This dissertation proposes a computational methodology to identify noun phrases in digital documents written in natural language. This research explains the adopted methodology to identify and to extract noun phrase through the development of SISNOP (Portuguese Noun Phrase Identifying System - SISNOP, in Portuguese). SISNOP is a system composed by a set of modules and programs, that is able to interpret any kind of text available in the natural language, using morphological and syntactic analyses, in order to recover noun phrases. Besides that, the system obtains syntactic information, as gender, number and degree of the words in the extracted noun phrases. The SISNOP tested, among other corpora, CETENFolha, composed by 24 million words, and CETEMPúblico, about 180 million words in European Portuguese and widely used in papers like of this study field. It was obtained 98,12% and 94,59% of sentences recognized by the system, getting up to 24 million identified noun phrases. The SISNOP modules: EM – Morphologic Tagger, ISN – Noun Phrases Identifier and IGNG – Gender, Number and Degree, were tested individually using a smaller set of data than the former one, because the results analyses were made manually. Noun phrase identifier module got 82,45% of precision and 69,20% of recall. |