Desambiguador morfossintático baseado em regras para o nheengatu

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: Gurgel, Juliana Lopes
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.ufc.br/handle/riufc/75294
Resumo: In this study, we describe the implementation of a disambiguation module for Nheengatagger (ALENCAR, 2020), a part-of-speech tagger for Nheengatu. This indigenous language, also known as the Amazonian Lingua Franca, has an estimated number of 14,000 speakers and is currently at risk of extinction (EBERHARD; SIMONS; FENNIG, 2023). One of the risk factors for minority language extinction is the unavailability of low-resource language tools aimed at their computational processing. In the perspective of automatic text processing of Nheengatu, Nheengatagger is one of the few initiatives. Despite correctly labeling most of the words in texts in which the orthography adopted by Navarro (2016) was used, it is still necessary for the tagger to be able to resolve ambiguities, that is, to assign the correct label to words that have more than one part-of-speech. Thus, this work aims at implementing a rule-based disambiguation module. We divided our methodology into two main stages: the compilation of texts in Nheengatu from the works of Navarro (2016), Navarro and Ávila (2017), Casasnovas (2006) and Trevisan (2017) and the implementation of the module. The compilation of the texts resulted in a corpus with 4176 sentences. The stages of implementation were: (i) the identification of ambiguities; (ii) the analysis of the contexts of the parts-of-speech in Nheengatu; (iii) the implementation of the algorithm; and (iv) the evaluation. In stage (i), we identified 55 types of ambiguities, with a total of 1047 occurrences in the development corpus (NAVARRO, 2016). In stage (iv), the disambiguator achieved accuracies of 52% and 74% in the two preliminary tests carried out with a set of 50 sentences, a result below the state-of-the-art for this type of tool, which is 95%. Based on the results of the preliminary tests, we decided to evaluate the performance of the tool using contexts extracted from sentences with and without ambiguities. In the three tests carried out after adjustments to the tool, we obtained accuracies of 80.9%, 60% and 57.5%, respectively, a result still below the state-of-the-art. On the other hand, the disambiguator significantly increased Nheengatagger's hit rate. After the integration of the module, the POS tagger achieved rates of 88.9%, 95.4% and 96.2% in the three tests, respectively.