Determining the number of clusters in categorical data

Silvestre, Cláudia; Cardoso, Margarida; Figueiredo, Mário

Determining the number of clusters in categorical data

Bibliographic Details
Main Author:	Silvestre, Cláudia
Publication Date:	2013
Other Authors:	Cardoso, Margarida, Figueiredo, Mário
Language:	eng
Source:	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full:	http://hdl.handle.net/10400.21/4048
Summary:	Cluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.

Item metadata

id	RCAP_e7fe10cb928d4304cedb24e2b2f3e68d
oai_identifier_str	oai:repositorio.ipl.pt:10400.21/4048
network_acronym_str	RCAP
network_name_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str	https://opendoar.ac.uk/repository/7160
spelling	Determining the number of clusters in categorical dataCluster analysisModel selectionCategorical variablesCluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.RCIPLSilvestre, CláudiaCardoso, MargaridaFigueiredo, Mário2014-12-12T13:11:24Z2013-072013-07-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/mswordhttp://hdl.handle.net/10400.21/4048enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-12T07:34:33Zoai:repositorio.ipl.pt:10400.21/4048Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T19:50:31.896262Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv	Determining the number of clusters in categorical data
title	Determining the number of clusters in categorical data
spellingShingle	Determining the number of clusters in categorical data Silvestre, Cláudia Cluster analysis Model selection Categorical variables
title_short	Determining the number of clusters in categorical data
title_full	Determining the number of clusters in categorical data
title_fullStr	Determining the number of clusters in categorical data
title_full_unstemmed	Determining the number of clusters in categorical data
title_sort	Determining the number of clusters in categorical data
author	Silvestre, Cláudia
author_facet	Silvestre, Cláudia Cardoso, Margarida Figueiredo, Mário
author_role	author
author2	Cardoso, Margarida Figueiredo, Mário
author2_role	author author
dc.contributor.none.fl_str_mv	RCIPL
dc.contributor.author.fl_str_mv	Silvestre, Cláudia Cardoso, Margarida Figueiredo, Mário
dc.subject.por.fl_str_mv	Cluster analysis Model selection Categorical variables
topic	Cluster analysis Model selection Categorical variables
description	Cluster analysis for categorical data has been an active area of research. A well-known problem in this area is the determination of the number of clusters, which is unknown and must be inferred from the data. In order to estimate the number of clusters, one often resorts to information criteria, such as BIC (Bayesian information criterion), MML (minimum message length, proposed by Wallace and Boulton, 1968), and ICL (integrated classification likelihood). In this work, we adopt the approach developed by Figueiredo and Jain (2002) for clustering continuous data. They use an MML criterion to select the number of clusters and a variant of the EM algorithm to estimate the model parameters. This EM variant seamlessly integrates model estimation and selection in a single algorithm. For clustering categorical data, we assume a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008). Results obtained with synthetic datasets are encouraging. The main advantage of the proposed approach, when compared to the above referred criteria, is the speed of execution, which is especially relevant when dealing with large data sets.
publishDate	2013
dc.date.none.fl_str_mv	2013-07 2013-07-01T00:00:00Z 2014-12-12T13:11:24Z
dc.type.driver.fl_str_mv	conference object
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10400.21/4048
url	http://hdl.handle.net/10400.21/4048
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/msword
dc.source.none.fl_str_mv	reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP
instname_str	FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv	info@rcaap.pt
_version_	1833598352827088896

Determining the number of clusters in categorical data

Similar Items