Detecting outliers and annotating their types with indexing structures

Detalhes bibliográficos
Ano de defesa: 2022
Autor(a) principal: Silva, Guilherme Domingos Faria
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-15092022-141353/
Resumo: The constant increase in the amount of data available on the internet is accentuated with the popularization of technologies such as 5G and Internet of Things. In datasets of large volume there is usually a strong presence of outliers that are not detected or that are just discarded. The outlier detection literature demonstrates that the investigation of these singular instances can provide new insights into the behavior of systems and people. This inspection allows diseases to be identified early, financial market trends to be better interpreted and cybersecurity attacks to be prevented. However, outlier detection techniques carry limitations, being: (1) dependent on the availability of the instances features, which can generate privacy issues; (2) poorly scalable and; (3) capable of providing only a binary separation that allows detecting outliers, but not classifying them so that they are better understood. Starting from an unlabeled dataset for which only the distances between the instances are available, how to detect outliers and categorize them by type efficiently? In the vast literature on outlier detection, there is no work, as far as we know, that deals with the problem of annotating outliers. Outliers can be classified into three large groups: (a) global outliers, instances that are severely different from others in the dataset, such as errors during insertion of information into a database; (b) local outliers, instances that, despite being similar to the others in the dataset as a whole, have minimal variations that make them different in a smaller scope, for instance, a football player who makes many mistakes while playing in a strong team and; (c) collective outliers, small groups of instances that are, simultaneously, quite different from the rest, such as a denial-of-service cyberattack, with few machines having similar harmful behavior. In this project we introduce C-ALLOUT: a new method for detecting outliers that is also able to categorize them by type. C-ALLOUT is able to maintain itself in terms of equality, or even superiority, when compared to state-of-the-art algorithms, still contributing with the annotation of outliers, a task that competitors are not able to perform. C-ALLOUT is based on Slim-tree, an indexing structure that makes it scalable, with O(nlogn) complexity of time and space. Our proposed method deals with both scenarios: having the features available or limited to distances only. Finally, C-ALLOUT works without depending on any interaction with the user, being parameter-free by default, the ideal for unsupervised tasks like outlier analysis.