Segmentação de Documentos Jurídicos usando Supervisão Fraca

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: Marlon Daltro Tosta
Orientador(a): Eraldo Luis Rezende Fernandes
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Fundação Universidade Federal de Mato Grosso do Sul
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Brasil
Palavras-chave em Português:
Link de acesso: https://repositorio.ufms.br/handle/123456789/5852
Resumo: Millions of cases are currently being processed in the Brazilian judicial system. The court decisions, known as {\em acórdãos}, are collective decisions made by Brazilian courts and are of high relevance in ensuring a unified understanding among judges and different courts. Therefore, developing and implementing effective technological solutions to assist judges judges, appellate and other professionals involved in the judicial process to cope with the growing volume of court cases in Brazil. These solutions should be able to speed up decision making and reduce workload ensuring the efficiency of the judicial system and the satisfaction of citizens who depend on it. The judgments of Brazilian courts are publicly available, However, as these documents are not in a structured format their automatic processing is hampered. However, the lack of structured format in which these documents are available makes their automatic processing challenging. This work collected over 960,000 PDF-format acórdãos documents from five Brazilian courts and used available tools to extract textual content and layout characteristics from 624,161 of them. An automatic annotation method was used to segment the documents into five mandatory segments of acórdãos. A total of 500 documents were manually annotated and they were used as validation and test sets for machine learning models trained on weakly annotated data. Classic and deep learning-based machine learning models were evaluated, with deep learning models outperforming traditional algorithms. Additionally, models that used both textual content and layout information achieved even better results. Models trained and tested on the same court tend to perform comparably or even better than automatic annotation methods, while performance for models trained on one court and tested on another depends on the correlation between the courts. Models trained on judgments from four courts and validated on a fifth achieved even better performance, with an average F1 above 90\% in the best models. General segmentation models showed a trend of improving performance as the variety of layouts in the training data increased, suggesting that expanding the variety of courts in the training data can lead to satisfactory practical performance. In this work, several resources that can be used in future work have been made available. All collected documents in PDF format, as well as the corresponding TSV and JSON files with automatic annotations, are freely available. The automatic segmentation scripts are also available, as are the scripts used for model training and evaluation. Finally, the manually reviewed annotations of 500 documents (100 from each court) are also available.