Robust outlier labeling rules for light-tailed and heavy-tailed Data

Detalhes bibliográficos
Ano de defesa: 2019
Autor(a) principal: Silva, Kelly Cristina Ramos da
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042019-145141/
Resumo: Outlier rules are used to detect outliers in univariate data. A commonly used outlier rule is based on a graphical tool for univariate data analysis, named the boxplot. However, it is well known that boxplot exhibits significantly lower performance for skewed distributions, in comparison to the symmetric case. In order to overcome this deficiency, an outlier rule known as adjusted boxplot, has been proposed in the literature. Adjusted boxplot modifies the classical boxplot by incorporating into it a skewness measure. Although this modification has resulted in a state-of-the-art version of the classical boxplot, it has the drawback of leading to a rule that is not flexible enough to permit easily to pre-specify a nominal outside rate. Furthermore, the adjusted boxplot can present, for some situations, significantly higher computational cost than the classical boxplot, since its computational complexity is O(nlogn), while the classical boxplot is O(n): In order to address those issues, this thesis proposes a more formal approach to deriving outlier rules that proved to produce rules which exhibit overall better performance than that of the adjusted boxplot, specially as the contamination level increases. Moreover, those proposed rules have the advantages of being more flexible and possessing lower computational cost than the adjusted boxplot. Furthermore, it is shown that the classical boxplot and many of its modifications or variations are unified by the same concept introduced by this thesis: quartile contrast. The problem with the outlier rules based on quartile contrast, as well as the adjusted boxplot, lies in the fact that they are more suitable for light-tailed data than for heavy-tailed data. For heavy-tailed data, it has been proposed in the literature an outlier rule known as the generalized boxplot. The main problem with the generalized boxplot lies in the fact it is very unstable, since a single outlier might dramatically affect its performance. In order to address this issue, the thesis uses the quartile contrast approach to deriving an outlier rule sensitive to tail heaviness. The experimental analysis show that the tail-heaviness sensitive outlier rule proposed by the thesis indeed presents more stable performance than the generalized boxplot. The performance evaluation of outlier rules is a problem on its own. Therefore, to measure performance of outlier rules, the thesis introduces the GME, a measure that has proved to be more effective to assess performance of outlier rules than the traditional measures involving only false positive rate and false negative rate.