Busca por similaridade de boletins de ocorrência via embeddings: um estudo de caso

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: Araújo, José Alan Firmiano
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.ufc.br/handle/riufc/78192
Resumo: Several crimes happen daily, and the first step in the investigation begins with a police report. In cities with high crime rates, it is challenging for the police to handle the detailed analysis of all criminal reports. However, incident reports may be similar as they present the same modus operandi. Given an incident report, the main objective of this work is to determine the most similar or duplicate. A similar police report may be another report with overlapping words or one that shares a similar modus operandi. One possible solution is to represent each police report as a vector of characters and compare the vectors using a similarity function. Different methods can be employed to represent the narrative, including embedding vectors and count-based approaches such as TF-IDF. This research explores the use of pre-trained embedding representations at both the word and sentence levels, such as Universal Sentence Encoder, Word2Vec, RoBERTa, Doc2Vec, among others. We determine the most effective representation for capturing semantic and lexical similarities between police reports by comparing different embedding models. Furthermore, we compare the effectiveness of available pre-trained embedding models with models specifically trained on a corpus of police reports. Another contribution of this work is the development of embedding models trained specifically for the domain of police reports.