A Hybrid Machine Learning System for Vulnerability Detection in Web Applications

Bibliographic Details
Main Author: Oliveira, Miguel César de Albuquerque
Publication Date: 2023
Format: Master thesis
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10451/63629
Summary: Tese de mestrado, Ciências de Dados, 2023, Universidade de Lisboa, Faculdade de Ciências
id RCAP_41bd93c6b8100dd7113f7cc4ce591309
oai_identifier_str oai:repositorio.ulisboa.pt:10451/63629
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling A Hybrid Machine Learning System for Vulnerability Detection in Web Applicationsdeteção de vulnerabilidades de webaprendizagem automáticadetecção de anomaliasprocessamento de linguagem naturalsegurança de softwareTeses de mestrado - 2024Departamento de InformáticaTese de mestrado, Ciências de Dados, 2023, Universidade de Lisboa, Faculdade de CiênciasSecurity in web applications is often compromised by poorly written code that is exploited by attackers. Source code vulnerability detection tools have been developed using static analysis and machine learning techniques. The best performing tools seek for very low false negative rates along with acceptable false positives. Static analysis requires manual programming to identify vulnerabilities, depends on human expertise and is usually limited to a specific programming language. On the other hand, classical supervised machine learning approaches previously used may be limited to identify zero-day vulnerabilities or prone to overfit due to limited available datasets. This dissertation aims to develop a hybrid machine learning (ML) system for vulnerability detection of web applications. The system developed will use a combination of static analysis and Natural Language Processing (NLP) techniques to identify functions related to vulnerabilities that will be used to build representative datasets. The datasets will be used as input for unsupervised machine learning and other behaviour based anomaly detection algorithms in order to signalize as suspicious the code snippets under analysis. For these source code snippets, the system will aim to confirm which are vulnerable and identify the type of vulnerability via supervised machine learning techniques. The dissertation explores a novel approach to vulnerability detection by combining unsupervised anomaly detection models with supervised machine learning and Natural Language Processing techniques. Previous research in vulnerability detection has primarily focused on either unsupervised or supervised methods, neglecting the potential benefits of a hybrid approach. The goal of this research is to investigate the efficacy of hybrid architectures in identifying software vulnerabilities and to determine the optimal machine learning models and datasets for this purpose. The proposed hybrid model consists of different layers. The first uses a One Class Support Vector Machine model (OCSVM) to detect anomalies, the second employs a Random Forest Model to confirm the presence of vulnerabilities on the anomalies. The type of vulnerability is classified by a Logistic Regression Model that relies on the Doc2Vec model for feature extraction. The research includes experimentation with various machine learning models and datasets, evaluating simple binary features to more complex Doc2Vec embeddings. The thesis demonstrates OCSVM’s suitability for semi-unsupervised anomaly detection, yielding promising results across various datasets. Additionally, the study assesses Random Forests’ effectiveness in classifying vulnerable source code snippets based on OCSVMdetected anomalies and validate the use NLP techniques for feature extraction of sourcecode snippets. Overall, the proposed hybrid model achieved an accuracy of 65%. Although these results seems to be low, this research offers a promising hybrid approach to vulnerability detection, leveraging the strengths of unsupervised and supervised machine learning models. The findings suggest opportunities for further enhancements and optimizations, paving the way for more effective software vulnerability detection systems.Medeiros, Ibéria Vitória de Sousa, 1971-Repositório da Universidade de LisboaOliveira, Miguel César de Albuquerque2024-03-21T10:29:12Z202420232024-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10451/63629enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-17T15:13:12Zoai:repositorio.ulisboa.pt:10451/63629Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T03:36:56.608151Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv A Hybrid Machine Learning System for Vulnerability Detection in Web Applications
title A Hybrid Machine Learning System for Vulnerability Detection in Web Applications
spellingShingle A Hybrid Machine Learning System for Vulnerability Detection in Web Applications
Oliveira, Miguel César de Albuquerque
deteção de vulnerabilidades de web
aprendizagem automática
detecção de anomalias
processamento de linguagem natural
segurança de software
Teses de mestrado - 2024
Departamento de Informática
title_short A Hybrid Machine Learning System for Vulnerability Detection in Web Applications
title_full A Hybrid Machine Learning System for Vulnerability Detection in Web Applications
title_fullStr A Hybrid Machine Learning System for Vulnerability Detection in Web Applications
title_full_unstemmed A Hybrid Machine Learning System for Vulnerability Detection in Web Applications
title_sort A Hybrid Machine Learning System for Vulnerability Detection in Web Applications
author Oliveira, Miguel César de Albuquerque
author_facet Oliveira, Miguel César de Albuquerque
author_role author
dc.contributor.none.fl_str_mv Medeiros, Ibéria Vitória de Sousa, 1971-
Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Oliveira, Miguel César de Albuquerque
dc.subject.por.fl_str_mv deteção de vulnerabilidades de web
aprendizagem automática
detecção de anomalias
processamento de linguagem natural
segurança de software
Teses de mestrado - 2024
Departamento de Informática
topic deteção de vulnerabilidades de web
aprendizagem automática
detecção de anomalias
processamento de linguagem natural
segurança de software
Teses de mestrado - 2024
Departamento de Informática
description Tese de mestrado, Ciências de Dados, 2023, Universidade de Lisboa, Faculdade de Ciências
publishDate 2023
dc.date.none.fl_str_mv 2023
2024-03-21T10:29:12Z
2024
2024-01-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10451/63629
url http://hdl.handle.net/10451/63629
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833601766250250240