Detalhes bibliográficos
Ano de defesa: |
2024 |
Autor(a) principal: |
Garcia, Eduardo Augusto Santos
 |
Orientador(a): |
Silva, Nádia Félix Felipe da
 |
Banca de defesa: |
Silva, Nádia Félix Felipe da,
Lima, Eliomar Araújo de,
Soares, Anderson da Silva,
Placca, José Avelino |
Tipo de documento: |
Dissertação
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
por |
Instituição de defesa: |
Universidade Federal de Goiás
|
Programa de Pós-Graduação: |
Programa de Pós-graduação em Ciência da Computação (INF)
|
Departamento: |
Instituto de Informática - INF (RMG)
|
País: |
Brasil
|
Palavras-chave em Português: |
|
Palavras-chave em Inglês: |
|
Área do conhecimento CNPq: |
|
Link de acesso: |
http://repositorio.bc.ufg.br/tede/handle/tede/13781
|
Resumo: |
This research investigates the application of Natural Language Processing (NLP) within the legal domain for the Portuguese language, emphasizing the importance of domain adaptation for pre-trained language models, such as RoBERTa, using specialized legal corpora. We compiled and pre-processed a Portuguese legal corpus, named LegalPT, addressing the challenges of high near-duplicate document rates in legal corpora and conducting a comparison with generic web-scraped corpora. Experiments with these corpora revealed that pre-training on a combined dataset of legal and general data resulted in a more effective model for legal tasks. Our model, called RoBERTaLexPT, outperformed larger models trained solely on generic corpora, such as BERTimbau and Albertina-PT-*, and other legal models from similar works. For evaluating the performance of these models, we propose in this Master’s dissertation a legal benchmark composed of several datasets, including LeNER-Br, RRI, FGV, UlyssesNER-Br, CEIAEntidades, and CEIA-Frases. This study contributes to the improvement of NLP solutions in the Brazilian legal context by openly providing enhanced models, a specialized corpus, and a rigorous benchmark suite. |