Categorical data clustering using a minimum message length criterion

Bibliographic Details
Main Author: Silvestre, Cláudia
Publication Date: 2012
Other Authors: Cardoso, Margarida, Figueiredo, Mário
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10400.21/4047
Summary: Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.
id RCAP_abceb7fe507a67371986dcd2b375c4e4
oai_identifier_str oai:repositorio.ipl.pt:10400.21/4047
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Categorical data clustering using a minimum message length criterionCluster analysisCategorical dataExpectation-maximization algorithmMML - Minimum Message Lenght - criterionResearch on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.RCIPLSilvestre, CláudiaCardoso, MargaridaFigueiredo, Mário2014-12-12T12:22:05Z2012-102012-10-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/mswordhttp://hdl.handle.net/10400.21/4047enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-12T10:52:54Zoai:repositorio.ipl.pt:10400.21/4047Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T20:08:53.894028Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Categorical data clustering using a minimum message length criterion
title Categorical data clustering using a minimum message length criterion
spellingShingle Categorical data clustering using a minimum message length criterion
Silvestre, Cláudia
Cluster analysis
Categorical data
Expectation-maximization algorithm
MML - Minimum Message Lenght - criterion
title_short Categorical data clustering using a minimum message length criterion
title_full Categorical data clustering using a minimum message length criterion
title_fullStr Categorical data clustering using a minimum message length criterion
title_full_unstemmed Categorical data clustering using a minimum message length criterion
title_sort Categorical data clustering using a minimum message length criterion
author Silvestre, Cláudia
author_facet Silvestre, Cláudia
Cardoso, Margarida
Figueiredo, Mário
author_role author
author2 Cardoso, Margarida
Figueiredo, Mário
author2_role author
author
dc.contributor.none.fl_str_mv RCIPL
dc.contributor.author.fl_str_mv Silvestre, Cláudia
Cardoso, Margarida
Figueiredo, Mário
dc.subject.por.fl_str_mv Cluster analysis
Categorical data
Expectation-maximization algorithm
MML - Minimum Message Lenght - criterion
topic Cluster analysis
Categorical data
Expectation-maximization algorithm
MML - Minimum Message Lenght - criterion
description Research on cluster analysis for categorical data continues to develop, new clustering algorithms being proposed. However, in this context, the determination of the number of clusters is rarely addressed. We propose a new approach in which clustering and the estimation of the number of clusters is done simultaneously for categorical data. We assume that the data originate from a finite mixture of multinomial distributions and use a minimum message length criterion (MML) to select the number of clusters (Wallace and Bolton, 1986). For this purpose, we implement an EM-type algorithm (Silvestre et al., 2008) based on the (Figueiredo and Jain, 2002) approach. The novelty of the approach rests on the integration of the model estimation and selection of the number of clusters in a single algorithm, rather than selecting this number based on a set of pre-estimated candidate models. The performance of our approach is compared with the use of Bayesian Information Criterion (BIC) (Schwarz, 1978) and Integrated Completed Likelihood (ICL) (Biernacki et al., 2000) using synthetic data. The obtained results illustrate the capacity of the proposed algorithm to attain the true number of cluster while outperforming BIC and ICL since it is faster, which is especially relevant when dealing with large data sets.
publishDate 2012
dc.date.none.fl_str_mv 2012-10
2012-10-01T00:00:00Z
2014-12-12T12:22:05Z
dc.type.driver.fl_str_mv conference object
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.21/4047
url http://hdl.handle.net/10400.21/4047
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/msword
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833598514047746048