Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes

Bibliographic Details
Main Author: Paula Cristina Teixeira Fortuna
Publication Date: 2017
Format: Master thesis
Language: por
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: https://hdl.handle.net/10216/106028
Summary: Nowadays people are using more and more social networks to communicate their opinions, share information and experiences. In social networks people have the feeling of being deindividualized and can incur more frequently in aggressive communication. In this context, it is important that government and social networks platforms have tools to detect hate speech because it is harmful to its targets. In our work we investigate the problem of detecting hate speech online. Our first goal is to make a complete overview on the topic. However, describing the state of the art in the area of hate speech is not simple, because this topic is regarded by different areas, such as text mining, social sciences, and law. Our literature review is focused on the perspective of computer science and engineering and it is distinct from other works we found. We adopted an exhaustive and methodical method. We called it Systematic Literature Review. As a result, we concluded that the majority of the studies tackles this problem as a machine learning classification task and the studies use either general text mining features (e.g n-grams, word2vec), or hate speech specific features (e.g othering discourse). In the majority of the studies new datasets are collected, but those remain private, which makes more difficult to compare the results across the different studies. We concluded also that this field is still in an early stage, with several open research opportunities. As we found no research on the topic in Portuguese, the second goal of this work was to annotate a dataset for this language. Regarding the dataset annotation, we built a classification using a hierarchical structure. This is an innovative way of approaching the problem of hate speech automatic classification. Its main advantage is that it allows to better consider nuances in the hate speech concepts. We collect a dataset with 5,668 messages, from 1156 distinct users, annotated not only for hate speech, but also for more 83 subtypes of hate. Finally, we also try to prove that the hierarchical structure of classes used also allows to improve the performance of the classification models, since it is better suited for consider the different subtypes of hate speech and the intersections between those classes.
id RCAP_e4e7419e216be05cee0e2c6ba1c2e24b
oai_identifier_str oai:repositorio-aberto.up.pt:10216/106028
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classesEngenharia electrotécnica, electrónica e informáticaElectrical engineering, Electronic engineering, Information engineeringNowadays people are using more and more social networks to communicate their opinions, share information and experiences. In social networks people have the feeling of being deindividualized and can incur more frequently in aggressive communication. In this context, it is important that government and social networks platforms have tools to detect hate speech because it is harmful to its targets. In our work we investigate the problem of detecting hate speech online. Our first goal is to make a complete overview on the topic. However, describing the state of the art in the area of hate speech is not simple, because this topic is regarded by different areas, such as text mining, social sciences, and law. Our literature review is focused on the perspective of computer science and engineering and it is distinct from other works we found. We adopted an exhaustive and methodical method. We called it Systematic Literature Review. As a result, we concluded that the majority of the studies tackles this problem as a machine learning classification task and the studies use either general text mining features (e.g n-grams, word2vec), or hate speech specific features (e.g othering discourse). In the majority of the studies new datasets are collected, but those remain private, which makes more difficult to compare the results across the different studies. We concluded also that this field is still in an early stage, with several open research opportunities. As we found no research on the topic in Portuguese, the second goal of this work was to annotate a dataset for this language. Regarding the dataset annotation, we built a classification using a hierarchical structure. This is an innovative way of approaching the problem of hate speech automatic classification. Its main advantage is that it allows to better consider nuances in the hate speech concepts. We collect a dataset with 5,668 messages, from 1156 distinct users, annotated not only for hate speech, but also for more 83 subtypes of hate. Finally, we also try to prove that the hierarchical structure of classes used also allows to improve the performance of the classification models, since it is better suited for consider the different subtypes of hate speech and the intersections between those classes.2017-07-072017-07-07T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/10216/106028TID:201801990porPaula Cristina Teixeira Fortunainfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-27T19:39:06Zoai:repositorio-aberto.up.pt:10216/106028Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T23:26:45.265702Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes
title Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes
spellingShingle Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes
Paula Cristina Teixeira Fortuna
Engenharia electrotécnica, electrónica e informática
Electrical engineering, Electronic engineering, Information engineering
title_short Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes
title_full Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes
title_fullStr Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes
title_full_unstemmed Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes
title_sort Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes
author Paula Cristina Teixeira Fortuna
author_facet Paula Cristina Teixeira Fortuna
author_role author
dc.contributor.author.fl_str_mv Paula Cristina Teixeira Fortuna
dc.subject.por.fl_str_mv Engenharia electrotécnica, electrónica e informática
Electrical engineering, Electronic engineering, Information engineering
topic Engenharia electrotécnica, electrónica e informática
Electrical engineering, Electronic engineering, Information engineering
description Nowadays people are using more and more social networks to communicate their opinions, share information and experiences. In social networks people have the feeling of being deindividualized and can incur more frequently in aggressive communication. In this context, it is important that government and social networks platforms have tools to detect hate speech because it is harmful to its targets. In our work we investigate the problem of detecting hate speech online. Our first goal is to make a complete overview on the topic. However, describing the state of the art in the area of hate speech is not simple, because this topic is regarded by different areas, such as text mining, social sciences, and law. Our literature review is focused on the perspective of computer science and engineering and it is distinct from other works we found. We adopted an exhaustive and methodical method. We called it Systematic Literature Review. As a result, we concluded that the majority of the studies tackles this problem as a machine learning classification task and the studies use either general text mining features (e.g n-grams, word2vec), or hate speech specific features (e.g othering discourse). In the majority of the studies new datasets are collected, but those remain private, which makes more difficult to compare the results across the different studies. We concluded also that this field is still in an early stage, with several open research opportunities. As we found no research on the topic in Portuguese, the second goal of this work was to annotate a dataset for this language. Regarding the dataset annotation, we built a classification using a hierarchical structure. This is an innovative way of approaching the problem of hate speech automatic classification. Its main advantage is that it allows to better consider nuances in the hate speech concepts. We collect a dataset with 5,668 messages, from 1156 distinct users, annotated not only for hate speech, but also for more 83 subtypes of hate. Finally, we also try to prove that the hierarchical structure of classes used also allows to improve the performance of the classification models, since it is better suited for consider the different subtypes of hate speech and the intersections between those classes.
publishDate 2017
dc.date.none.fl_str_mv 2017-07-07
2017-07-07T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/10216/106028
TID:201801990
url https://hdl.handle.net/10216/106028
identifier_str_mv TID:201801990
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833600157256515584