Tratamento da informação de saúde para atendimento à necessidade de privacidade: desidentificação textual de documentos clínicos na língua portuguesa do Brasil

Detalhes bibliográficos
Ano de defesa: 2022
Autor(a) principal: Guilherme Francis de Noronha
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
Brasil
ECI - ESCOLA DE CIENCIA DA INFORMAÇÃO
Programa de Pós-Graduação em Gestão e Organização do Conhecimento
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/49136
Resumo: Introduction: the privacy protection is becoming relevant nowadays. Initiatives, such as General Data Privacy Regulation, or GDPR, emerged worldwide in an attempt to protect individual privacy and avoid bad use of personal data. The data protection becomes essential within digital context, where data leaks cannot be reverted. In the health area, the adoption of electronic health records led to the digitalization of millions of people sensitive data. A way to protect the data is the de-identification which assures the individual privacy. Besides the data protection, the de-identification also allows the clinical documents to be shared, allowing knowledge acquisition through research and data analysis. Problem: clinical documents have countless text fields that may have sensitive data to be protected. The manual de-identification in the health area is costly due to the amount of data created every day across several health facilities. An alternative to handle this situation is the automatic de-identification using techniques of machine learning and natural language processing. However, those algorithms should be trained using the local language where it will be validated. A preliminary research do not identified studies of de-identification for Brazilian Portuguese with available data. Therefore, was identified the opportunity to improve the field of study in de-identification for Brazilian Portuguese, developing research to privacy protection in clinical documents. Methodology: to handle the problem, the present thesis built a methodology to automatic de-identification data from clinical documents using natural language processing and machine learning algorithms. To achieve this, a partnership was made with the Hospital das Clínicas de Minas Gerais to obtain the clinical documents. These documents were preprocessed and used to the development of the de-identification algorithm adapted to Brazilian Portuguese language. Results: the deidentification algorithm obtained an F-Score (macro) of 97,94% and an F-Score (micro) of 39,83%. Only 37,09% of the data was correctly deidentified. Thus, the results were insufficient for a generalization. This thesis, however, presents as it contribution the methodology proposed to deidentify clinical documents. This methodology can be applied to any field, beyond the health, which has its needs on the privacy protection. Also, the source code developed during the methodology and the trained learning model is publicly available and can be used by everyone.