Identificação de posts maliciosos na dark web utilizando Aprendizado de Máquina Supervisionado
Ano de defesa: | 2024 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Uberlândia
Brasil Programa de Pós-graduação em Ciência da Computação |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | https://repositorio.ufu.br/handle/123456789/41232 https://doi.org/10.14393/ufu.di.2023.8127 |
Resumo: | In the face of the constant growth and sophistication of cyber attacks, cybersecurity can no longer rely solely on traditional defense techniques and tools. Proactive detection of cyber threats has become a necessity in today’s world, enabling security teams to identify potential threats and adopt effective mitigation measures. The field of Cyber Threat Intelligence (CTI) plays a fundamental role by providing security analysts with evidencebased knowledge about cyber threats. Information extraction from CTI can occur through various techniques and involve different data sources; however, machine learning has proven to be a promising approach in this area. Regarding data sources, social networks and online discussion forums have been commonly explored. In this dissertation, text mining, Natural Language Processing (NLP), and machine learning techniques are applied to data collected from Dark Web forums with the aim of identifying malicious posts. The training dataset was labeled considering the occurrence of Indicators of Compromise (IoCs), contextual keywords, and manual analysis. Different classification algorithms were tested using various text representations to find the best model. The results revealed that the model using the LightGBM algorithm and TF-IDF (Term Frequency-Inverse Document Frequency) with Unigram representation achieved the best metrics of accuracy, precision, recall, and F1-score. Additionally, new unlabeled posts were submitted to the classifier, showing promising results when analyzed using Topic Modeling with Latent Dirichlet Allocation (LDA). |