Determining the number of clusters in categorical data

Bibliographic Details
Main Author: Silvestre, Cláudia
Publication Date: 2013
Other Authors: Cardoso, Margarida, Figueiredo, Mário
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10400.21/4048
Summary: Cluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.
id RCAP_e7fe10cb928d4304cedb24e2b2f3e68d
oai_identifier_str oai:repositorio.ipl.pt:10400.21/4048
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Determining the number of clusters in categorical dataCluster analysisModel selectionCategorical variablesCluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.RCIPLSilvestre, CláudiaCardoso, MargaridaFigueiredo, Mário2014-12-12T13:11:24Z2013-072013-07-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/mswordhttp://hdl.handle.net/10400.21/4048enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-12T07:34:33Zoai:repositorio.ipl.pt:10400.21/4048Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T19:50:31.896262Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Determining the number of clusters in categorical data
title Determining the number of clusters in categorical data
spellingShingle Determining the number of clusters in categorical data
Silvestre, Cláudia
Cluster analysis
Model selection
Categorical variables
title_short Determining the number of clusters in categorical data
title_full Determining the number of clusters in categorical data
title_fullStr Determining the number of clusters in categorical data
title_full_unstemmed Determining the number of clusters in categorical data
title_sort Determining the number of clusters in categorical data
author Silvestre, Cláudia
author_facet Silvestre, Cláudia
Cardoso, Margarida
Figueiredo, Mário
author_role author
author2 Cardoso, Margarida
Figueiredo, Mário
author2_role author
author
dc.contributor.none.fl_str_mv RCIPL
dc.contributor.author.fl_str_mv Silvestre, Cláudia
Cardoso, Margarida
Figueiredo, Mário
dc.subject.por.fl_str_mv Cluster analysis
Model selection
Categorical variables
topic Cluster analysis
Model selection
Categorical variables
description Cluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.
publishDate 2013
dc.date.none.fl_str_mv 2013-07
2013-07-01T00:00:00Z
2014-12-12T13:11:24Z
dc.type.driver.fl_str_mv conference object
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.21/4048
url http://hdl.handle.net/10400.21/4048
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/msword
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833598352827088896