Using noise to detect test flakiness

SILVA, Denini Gabriel

Using noise to detect test flakiness

Detalhes bibliográficos
Ano de defesa:	2022
Autor(a) principal:	SILVA, Denini Gabriel
Orientador(a):	D'AMORIM, Marcelo Bezerra
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Universidade Federal de Pernambuco
Programa de Pós-Graduação:	Programa de Pos Graduacao em Ciencia da Computacao
Departamento:	Não Informado pela instituição
País:	Brasil
Palavras-chave em Português:	Engenharia de software e linguagens de programação Android Teste de software Depuração Evolução de software
Link de acesso:	https://repositorio.ufpe.br/handle/123456789/44567
Resumo:	A test is said to be flaky when it non-deterministically passes or fails in different runs on the same configuration (e.g., code). Test flakiness negatively affects regression testing as failure observations are not necessarily an indication of bugs in the program. Static and dynamic techniques for detecting flaky tests have been proposed in the literature but they are limited. Prior studies have shown that test flakiness is mostly caused by concurrent behavior. Based on that observation, we hypothesize that adding noise in the environment (stress tests consuming machine resources such as CPU and memory) can interfere in the ordering of program events and, consequently, it can influence the test outputs. We propose Shaker, a practical technique to detect flaky tests by comparing the outputs of multiple test runs in noisy environments. Compared with a regular test run, one test run with Shaker is slower as the environment is loaded, i.e., the process that runs a given test competes for resources with stressor tasks that Shaker creates. However, we conjecture that Shaker pays off by detecting flakiness in fewer runs compared with the alternative of running the test suite multiple times in a regular (non-noisy) environment. We evaluated Shaker using a public benchmark of flaky tests, obtaining encouraging results. For example, we found that (1) Shaker is 96% precise; it is almost as precise as ReRun, which by definition does not report false positives, that (2) Shaker’s recall is much higher compared to ReRun’s (95% versus 65%), and that (3) Shaker detects flaky tests much more efficiently than ReRun, despite the execution overhead associated with noise introduction. To sum up, results indicate that noise is a promising approach to detect flakiness.

Using noise to detect test flakiness

Registros relacionados