Expansão do MorphoBr através da modelagem computacional de processos de formação de palavras em português

Detalhes bibliográficos
Ano de defesa: 2019
Autor(a) principal: Silva, Hélio Leonam Barroso
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://www.repositorio.ufc.br/handle/riufc/47136
Resumo: In this work, we computationally modeled four word-formation processes (WFP) of Portuguese using finite-state morphology in order to automatically generate new lexical entries contributing with the research group Computação e Linguagem Natural on the development of resources for the Natural Language Processing (NLP) for the Portuguese language. Among the numerous challenges a PLN system must deal with by processing texts in natural language, plenty of them has to do with the lexical aspect, which interacts directly and indirectly with all the other levels of the system. Having well-structured and comprehensive lexical resources in hand decisively influences the efficiency of the PLN system. The best resource we are aware of is MorphoBr (ALENCAR; RADEMAKER; CUCONATO, 2018), built from the combination, revision and expansion of freely available analogous resources of Portuguese derived mostly from Label-Lex (ELEUTÉRIO et al., 1995) and Unitex-PB (MUNIZ, 2004). This expansion occurred by automatically generating, for instance, diminutive forms of adjectives and nouns and missing inflected forms of verbs. We propose the automatic generation of lexical entries by taking advantage of the existing ones as base forms for the word-formation processes by suffixation in order to retrofeed MorphoBr’s data set. The four WFP selected were the equivalent to the suffixes -vel, -idade, -izar e - mente. As the morphosyntactic classes of the base and the product only are not enough to ascertain the word’s well-formedness, we take into account the various restrictions of every selected WFP documented in Alves (2004), Basilio (1980, 1987, 1990, 2017), Cavalcante (1996), Maroneze (2005, 2011), Rocha (2008) and Villalva & Silvestre (2014). In terms of relative lemma quantity, we reached an increase of 8,5% with -vel, of 9,5% with -idade, 9,8% with ‑izar and 12,9% with -mente. In terms of absolute inflectional forms quantity, we have generated 45,564 adjective forms, 16,962 adverbs, 24,978 noun forms, and 833,560 verb forms. As a first step towards the modeling of the WFP related to the suffixes ‑ção and ‑mento, we analyzed the competition between both suffixes for first conjugation verb bases.