Natural language processing for sensitive data recognition and privacy in digital documents

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Vieira, Samuel Antunes lattes
Orientador(a): Rieder, Rafael lattes
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade de Passo Fundo
Programa de Pós-Graduação: Programa de Pós-Graduação em Computação Aplicada
Departamento: Instituto de Tecnologia – ITEC
País: Brasil
Palavras-chave em Português:
Área do conhecimento CNPq:
Link de acesso: http://tede.upf.br:8080/jspui/handle/tede/2765
Resumo: Keeping confidential information secure in personal documents has always been critical to guarantee the privacy of people or companies. With the frequent digitalization of documents and the adoption of laws and regulations, this task has become even more relevant. In this context, security applications can censor critical text in documents digital. How protecting data through censorship can require intensive manual work to identify the specific location of sensitive data and is subject to errors humans, automation is an option to handle the entire process. With that in mind, this work presents DOCDOM, a proof-of-concept software that integrates multiple tools for recognizing sensitive data and privacy in digital documents. The approach considers optical character recognition to obtain text data from documents, applies a natural language processing model focused on recognition of named entities to identify confidential data, and censor these using library resources for digital document processing. The results Preliminaries showed that DOCDOM works well, achieving evaluation metrics reasonable for two test data sets of 1000 files each (AUC-PR Curves 0.9266 and 0.6681). A detailed analysis identified that there are noise problems in some files during text classification tasks, which still need to be handled through noise distinction and filtering strategies. Despite this, the proposed solution presented acceptable initial results for a proof of concept, with good precision and accuracy for files with a simple structure and sensitive non-numeric content.