Infraestrutura de kernel para coleta de dados de eventos de falha no Linux

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Maciel, Vinícius Fonseca
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso embargado
Idioma: por
Instituição de defesa: Universidade Federal de Uberlândia
Brasil
Programa de Pós-graduação em Ciência da Computação
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://repositorio.ufu.br/handle/123456789/44430
https://doi.org/10.14393/ufu.di.2024.774
Resumo: Computing systems demand high reliability as they are intrinsically involved in various contexts that directly impact human activities. Failures, whether in user applications, services, or the operating system kernel, can range from minor inconveniences to disasters involving lives. Reliability is a fundamental metric to statistically quantify the level of trust one can place in software. Based on the observed importance of specific mechanisms for failure collection and analysis in systems like Windows, through the Reliability Analysis Component (RAC), the need for similar analyses for Linux was identified. For this reason, a kernel infrastructure, the Linux Reliability Analysis Component (LRAC), was created to enable the collection and storage of failure data within this operating system. This work focuses on investigating the mechanisms of General Protection Fault (GPF) and Page Fault (PF) failures and how they can be methodologically identified by LRAC. Violation conditions for x86 processors, which trigger these failures, were analyzed and applied to develop a new taxonomy aimed at making the classification of these failures more precise and less generic. A new data collection protocol was incorporated into LRAC to reflect these specificities. Subsequently, controlled tests were conducted to reproduce failure events to test and evaluate the new functionalities proposed for LRAC. The results demonstrated that distinct failure characteristics are often diagnosed generically by traditional Linux mechanisms and that the new functionalities proposed for LRAC were effective in distinguishing and classifying these differences.