Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes
| Main Author: | |
|---|---|
| Publication Date: | 2017 |
| Format: | Master thesis |
| Language: | por |
| Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| Download full: | https://hdl.handle.net/10216/106028 |
Summary: | Nowadays people are using more and more social networks to communicate their opinions, share information and experiences. In social networks people have the feeling of being deindividualized and can incur more frequently in aggressive communication. In this context, it is important that government and social networks platforms have tools to detect hate speech because it is harmful to its targets. In our work we investigate the problem of detecting hate speech online. Our first goal is to make a complete overview on the topic. However, describing the state of the art in the area of hate speech is not simple, because this topic is regarded by different areas, such as text mining, social sciences, and law. Our literature review is focused on the perspective of computer science and engineering and it is distinct from other works we found. We adopted an exhaustive and methodical method. We called it Systematic Literature Review. As a result, we concluded that the majority of the studies tackles this problem as a machine learning classification task and the studies use either general text mining features (e.g n-grams, word2vec), or hate speech specific features (e.g othering discourse). In the majority of the studies new datasets are collected, but those remain private, which makes more difficult to compare the results across the different studies. We concluded also that this field is still in an early stage, with several open research opportunities. As we found no research on the topic in Portuguese, the second goal of this work was to annotate a dataset for this language. Regarding the dataset annotation, we built a classification using a hierarchical structure. This is an innovative way of approaching the problem of hate speech automatic classification. Its main advantage is that it allows to better consider nuances in the hate speech concepts. We collect a dataset with 5,668 messages, from 1156 distinct users, annotated not only for hate speech, but also for more 83 subtypes of hate. Finally, we also try to prove that the hierarchical structure of classes used also allows to improve the performance of the classification models, since it is better suited for consider the different subtypes of hate speech and the intersections between those classes. |
| id |
RCAP_e4e7419e216be05cee0e2c6ba1c2e24b |
|---|---|
| oai_identifier_str |
oai:repositorio-aberto.up.pt:10216/106028 |
| network_acronym_str |
RCAP |
| network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository_id_str |
https://opendoar.ac.uk/repository/7160 |
| spelling |
Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classesEngenharia electrotécnica, electrónica e informáticaElectrical engineering, Electronic engineering, Information engineeringNowadays people are using more and more social networks to communicate their opinions, share information and experiences. In social networks people have the feeling of being deindividualized and can incur more frequently in aggressive communication. In this context, it is important that government and social networks platforms have tools to detect hate speech because it is harmful to its targets. In our work we investigate the problem of detecting hate speech online. Our first goal is to make a complete overview on the topic. However, describing the state of the art in the area of hate speech is not simple, because this topic is regarded by different areas, such as text mining, social sciences, and law. Our literature review is focused on the perspective of computer science and engineering and it is distinct from other works we found. We adopted an exhaustive and methodical method. We called it Systematic Literature Review. As a result, we concluded that the majority of the studies tackles this problem as a machine learning classification task and the studies use either general text mining features (e.g n-grams, word2vec), or hate speech specific features (e.g othering discourse). In the majority of the studies new datasets are collected, but those remain private, which makes more difficult to compare the results across the different studies. We concluded also that this field is still in an early stage, with several open research opportunities. As we found no research on the topic in Portuguese, the second goal of this work was to annotate a dataset for this language. Regarding the dataset annotation, we built a classification using a hierarchical structure. This is an innovative way of approaching the problem of hate speech automatic classification. Its main advantage is that it allows to better consider nuances in the hate speech concepts. We collect a dataset with 5,668 messages, from 1156 distinct users, annotated not only for hate speech, but also for more 83 subtypes of hate. Finally, we also try to prove that the hierarchical structure of classes used also allows to improve the performance of the classification models, since it is better suited for consider the different subtypes of hate speech and the intersections between those classes.2017-07-072017-07-07T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/10216/106028TID:201801990porPaula Cristina Teixeira Fortunainfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-27T19:39:06Zoai:repositorio-aberto.up.pt:10216/106028Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T23:26:45.265702Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
| dc.title.none.fl_str_mv |
Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes |
| title |
Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes |
| spellingShingle |
Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes Paula Cristina Teixeira Fortuna Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
| title_short |
Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes |
| title_full |
Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes |
| title_fullStr |
Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes |
| title_full_unstemmed |
Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes |
| title_sort |
Automatic detection of hate speech in text: an overview of the topic and dataset annotation with hierarchical classes |
| author |
Paula Cristina Teixeira Fortuna |
| author_facet |
Paula Cristina Teixeira Fortuna |
| author_role |
author |
| dc.contributor.author.fl_str_mv |
Paula Cristina Teixeira Fortuna |
| dc.subject.por.fl_str_mv |
Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
| topic |
Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
| description |
Nowadays people are using more and more social networks to communicate their opinions, share information and experiences. In social networks people have the feeling of being deindividualized and can incur more frequently in aggressive communication. In this context, it is important that government and social networks platforms have tools to detect hate speech because it is harmful to its targets. In our work we investigate the problem of detecting hate speech online. Our first goal is to make a complete overview on the topic. However, describing the state of the art in the area of hate speech is not simple, because this topic is regarded by different areas, such as text mining, social sciences, and law. Our literature review is focused on the perspective of computer science and engineering and it is distinct from other works we found. We adopted an exhaustive and methodical method. We called it Systematic Literature Review. As a result, we concluded that the majority of the studies tackles this problem as a machine learning classification task and the studies use either general text mining features (e.g n-grams, word2vec), or hate speech specific features (e.g othering discourse). In the majority of the studies new datasets are collected, but those remain private, which makes more difficult to compare the results across the different studies. We concluded also that this field is still in an early stage, with several open research opportunities. As we found no research on the topic in Portuguese, the second goal of this work was to annotate a dataset for this language. Regarding the dataset annotation, we built a classification using a hierarchical structure. This is an innovative way of approaching the problem of hate speech automatic classification. Its main advantage is that it allows to better consider nuances in the hate speech concepts. We collect a dataset with 5,668 messages, from 1156 distinct users, annotated not only for hate speech, but also for more 83 subtypes of hate. Finally, we also try to prove that the hierarchical structure of classes used also allows to improve the performance of the classification models, since it is better suited for consider the different subtypes of hate speech and the intersections between those classes. |
| publishDate |
2017 |
| dc.date.none.fl_str_mv |
2017-07-07 2017-07-07T00:00:00Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
| format |
masterThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/10216/106028 TID:201801990 |
| url |
https://hdl.handle.net/10216/106028 |
| identifier_str_mv |
TID:201801990 |
| dc.language.iso.fl_str_mv |
por |
| language |
por |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
| instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| instacron_str |
RCAAP |
| institution |
RCAAP |
| reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| repository.mail.fl_str_mv |
info@rcaap.pt |
| _version_ |
1833600157256515584 |