Imputação de dados sintéticos através de árvores de classificação
Ano de defesa: | 2019 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
Brasil Programa de Pós-Graduação em Estatística UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/30799 |
Resumo: | This work presents a study on the methodology of synthetic data generation through classification and regression trees. This methodology is used when there is any restriction on disclosure of sensitive information for ethical or moral reasons and there is an interest in disclosing such information. Synthetic data use the idea of multiple imputation, where the original values are imputed by new values based on the distributions of the variables involved in the study. Several methodologies can be used to generate synthetic data. In this work we used classification and regression trees (CART) to classify the groups involved in the study, the Bayesian bootstrap to estimate the density of each group and the inverse CDF method for the final generation of synthetic data. The objective of this work is to extend the methodology used by Reiter and Drechsler (2011) to generate synthetic data using non-parametric models for different distributions of the sensitive variable, including the case of distributions with heavy tails. We will also present the calculation to measure risk for different hypotheses about the information that a possible intruder may have. We present the generation of synthetic data for three simulated scenarios with different distributions to verify the efficiency of the model. We also analyzed a real database. For the simulated scenarios, scenario 2 presented worse results than scenarios 1 and 3, due to the distribution of the response variable. For the real database, the results were considered satisfactory. |