On benchmarks of bugs for studies in automatic program repair

Detalhes bibliográficos
Ano de defesa: 2019
Autor(a) principal: Delfim, Fernanda Madeiral
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Federal de Uberlândia
Brasil
Programa de Pós-graduação em Ciência da Computação
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://repositorio.ufu.br/handle/123456789/44574
http://doi.org/10.14393/ufu.te.2024.5079
Resumo: Software systems are indispensable for everyone's daily activities nowadays. If a system behaves differently from what is expected, it is said to contain a bug. These bugs are ubiquitous, and manually fixing them is well-known as an expensive task. This motivates the emerging research field automatic program repair, which aims to fix bugs without human intervention. Despite the effort that researchers have made in the last decade by showing that automatic repair approaches have the potential to fix bugs, there is still a lack of knowledge about the real value and limitations of the proposed repair tools. This is due to the high-level, non-advanced evaluations performed on the same benchmark of bugs. In this thesis, we report on contributions to the research field of automatic repair research focused on the evaluation of repair tools. First, we address the problem of the scarcity of benchmarks of bugs. We propose a new benchmark of real reproducible bugs, named Bears, which was built with a novel approach to mining bugs from software repositories. Second, we address the problem of the lack of knowledge about benchmarks of bugs. We present a descriptive study on Bears and two other benchmarks, including analyses of different aspects of their bugs. Finally, we address the problem of the extensive usage of the same benchmark of bugs to evaluate repair tools. We define the benchmark overfitting problem and investigate it through an empirical study on repair tools over Bears and the other two benchmarks. We found that most repair tools indeed overfit the extensively used benchmark. Our findings from both studies suggest that the benchmarks of bugs are complementary to each other and that the usage of multiple and diverse benchmarks of bugs is key to evaluating the generalization of the effectiveness of automatic repair tools.