Contributions to the mixed data clustering problem: from a conceptual codification and classification proposal to the usage of optimization methods

Detalhes bibliográficos
Ano de defesa: 2021
Autor(a) principal: Fróes, Nádia Junqueira Martarelli
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/18/18156/tde-16112021-121737/
Resumo: Data mining techniques have gained prominence in recent years due to their wide range of applications. Among such techniques is data clustering, which identifies groups in unlabeled datasets. In view of the need to describe the characteristics of a phenomenon in more detail through numerical and categorical attributes, the development of the data clustering technique started to include the study of the mixed datasets in the process of clustering. Although promising, this new branch of study is still recent in the literature. Thus, this thesis aims to contribute to the advancement of the mixed data clustering problem, through four objectives, which are: to propose a standard representation for documents published in the knowledge-discovery field, to carry out a systematic review of the literature that provides a comprehensive view of this thematic and also a detailed comprehension of the selected documents; to perform the modeling of the meta-heuristic Biased Random-Key Genetic Algorithm (BRKGA) to propose a solution to the feature balancing problem in distance-based mixed data clustering algorithms; and to perform the modeling and hybridization of the following meta-heuristics: Evolutionary Clustering Search, Iterated Local Search, and BRKGA, to propose a solution to the feature weighting problem in a model-based mixed data clustering algorithm. As a result, one proposed for an initial idea for a standard representation, and a systematic review of the literature was obtained with a bibliometric and individual analysis of 160 documents resulted from the selection step of the designed methodological procedure. In addition, the proposed meta-heuristics obtained the best performances in most of the 476 simulated datasets, which contemplated several characteristics, such as data generated through the normal and lognormal distribution, balancing and unbalancing numerical and categorical attributes in relation to the amount of each type of feature in the dataset, different levels of overlapping of attributes, among others. Therefore, one concludes that this thesis reached its objectives, contributing to the advancement of the mixed data clustering technique.