Detecção de linguagem tóxica aplicada a textos em português

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: Trajano, Douglas de Oliveira lattes
Orientador(a): Bordini, Rafael Heitor lattes
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Pontifícia Universidade Católica do Rio Grande do Sul
Programa de Pós-Graduação: Programa de Pós-Graduação em Ciência da Computação
Departamento: Escola Politécnica
País: Brasil
Palavras-chave em Português:
Palavras-chave em Inglês:
Área do conhecimento CNPq:
Link de acesso: https://tede2.pucrs.br/tede2/handle/tede/10782
Resumo: The advent of social media has transformed the way in which individuals and communities interact and communicate. However, the messages on social media may contain expressions of opinion, and support messages, but they can also hate speech. The proliferation of hate speech in the digital sphere has become an increasingly pressing issue, with polarized opinions and a sense of anonymity and impunity among users often serving as contributing factors. The haters, users who spread hate speech, can be found in a variety of topics, including political discussions, entertainment, gaming, and corporate environments. The Natural Language Processing (NLP) area can contribute with tools to ensure healthy communication and protect users’ rights online. NLP applications are efficient, standardized, and automated, eliminating the need for manual moderation of such content. In this study, we used advanced machine learning and deep learning techniques to develop a toxic language detection system in Portuguese messages. The dataset used for training the models consists of 6,354 (with the possibility of extending to 13,538) comments manually annotated by experts. This dataset, made available as part of the work, has annotations for 5 NLP tasks, using a hierarchical annotation scheme with different levels of granularity. The results of the experiments demonstrate the usefulness of this dataset for the development of NLP systems aimed at detecting toxic language in texts in Portuguese.