Detalhes bibliográficos
Ano de defesa: |
2018 |
Autor(a) principal: |
Bispo, Thiago Dias |
Orientador(a): |
Macedo, Hendrik Teixeira |
Banca de defesa: |
Não Informado pela instituição |
Tipo de documento: |
Dissertação
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
por |
Instituição de defesa: |
Não Informado pela instituição
|
Programa de Pós-Graduação: |
Pós-Graduação em Ciência da Computação
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: |
|
Palavras-chave em Inglês: |
|
Área do conhecimento CNPq: |
|
Link de acesso: |
http://ri.ufs.br/jspui/handle/riufs/10659
|
Resumo: |
One of the consequences of the popularization of Internet access is the spread of insults and discriminatory messages, the so-called hatespeeches. They are comments that aim to discriminate against someone or a group of people because they belong to a certain group, usually minority, or have some characteristic common to other people. Fighting hates peech is a growing demand in real and virtual life as it profoundly affects the dignity of its victims. Detection of hatespeech is a difficult task because, in addition to natural language being inherently ambiguous, it requires a certain level of understanding of its linguistic structure. In many discourses, discrimination does not happen explicitly or with typical expressions: it is necessary world knowledge to recognize them. In addition, sometimes it is necessary to understand the context of the sentence to perceive its hateful content. Sarcasm is another huge challenge (even for humans) since its presence requires knowledge of the community and potentially of the user responsible for the comment for understanding their intent. Several approaches have been proposed for the hatespeech recognition task . Many authors consider the use of N-grams, of which those based on characters are more effective than those based on words. Combined or not with N-grams, lexical features were also evaluated, such as the presence or absence of negative words, classes or expressions indicative of insult, punctuation marks, letter repetitions, the presence of emoji, etc. Linguistic features were inefficient when used alone, such as POS tag, and the relationship between the terms of the dependency tree resulting from the syntax analysis. Recently, the most successful approach has used a neural network to create a distributed representation of the sentences present in a corpus of hatespeech, indicating that word embeddings training is a promising path in the area of hatespeech. Language drastically affects the tasks of Natural Language Processing (NLP), since most, if not all, words differ from one language to another, as well as their syntax, morphology, and linguistic construction. Thanks to this, works in English are not directly applicable in corpora of Portuguese language. In addition, corpora in Portuguese for hatespeech are rare, making researchers in the area to do all the construction work. In this dissertation we studied the use of deep cross-lingual Long Short-Term Memory (LSTM) model, trained with a hatespeech dataset translated from English in two different ways, preprocessed and vectorized with several strategies that were represented in 24 scenarios. The main approaches adopted included the training of embeddings through word index vectors (State of the Art technique), TFIDF vectors, N-grams vectors, with or without GloVe vocabulary, tested with the dataset constructed and labeled in this work and with another available in Portuguese. The inverted process was also tried out: we translated our corpus into English and compared the performance with its original version. With the embeddings resulting from the training process in each scenario, we used a Gradient Boosting Decision Tree (GBDT) as a means of improving classification. In fact, the results obtained with LSTM were improved in many scenarios. We achieved accuracy of up to 70 % in the experiments using the model written with the corpus in English and our dataset translated into this language. In others, traditional and successful techniques such as TFIDF vectors associated with an LSTM have not proved sufficient. Two important contributions of this work are: (i) proposal of an alternative research approach to attack the problem based on the translation of corpora and (ii) provision of a dataset of hatespeech in Portuguese to the community. |