Comparação de modelos de aprendizado de máquina interpretáveis na predição de calor de combustão e de formação

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: Maraschin, Mikael
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Santa Maria
Brasil
Engenharia Química
UFSM
Programa de Pós-Graduação em Engenharia Química
Centro de Tecnologia
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.ufsm.br/handle/1/29514
Resumo: The determination of physical-chemical properties for substances is of paramount importance in the field of chemical engineering, as these are related to equipment sizing, operational conditions, and process efficiencies. Since experimental data for certain substances are not always available, it is necessary to develop and use equations to determine these properties. In recent decades, there has been a popularization of machine learning algorithms. Through an interactive training process with a database, these algorithms have become capable of making predictions. In order to evaluate the integration between different methods for property prediction, a total of 551 data points for pure substances, consisting of carbon, hydrogen, oxygen, nitrogen, and sulfur, were used. These pure substances were represented computationally by the number and type of atoms or by the number and type of chemical bonds between these atoms. These variables served as inputs for all trained models. To establish the relationship between these substances and their respective thermodynamic properties, namely the heat of combustion and formation, multivariable linear regression models, symbolic regression, artificial neural networks, gradient boosting based on decision trees, and regression vector support machines were employed. All of these methods were trained using a data split of 70% for training, 15% for validation, and 15% for testing. Finally, the multivariable linear regression model, specifically for the description based on chemical bonds, outperformed the other methods. It resulted in a Pearson correlation coefficient of 99.93% and 96.43% for the test data of heat of combustion and heat of formation, respectively. This demonstrates that the linear model approach is suitable for organic substances composed of C, H, O, N, S. In addition to evaluating the goodness of fit, a local contribution analysis was employed for each input variable using a calculation methodology derived from game theory, known as Shapley values. This analysis allowed for the identification of the influence of each variable in comparison with the average value predicted by the model.