Feature transformation and reduction for text classification

J. Ferreira, Artur; Figueiredo, Mario

Feature transformation and reduction for text classification

Bibliographic Details
Main Author:	J. Ferreira, Artur
Publication Date:	2010
Other Authors:	Figueiredo, Mario
Language:	eng
Source:	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full:	http://hdl.handle.net/10400.21/17914
Summary:	Text classification is an important tool for many applications, in su pervised, semi-supervised, and unsupervised scenarios. In order to be processed by machine learning methods, a text (document) is usually represented as a bag-of-words (BoW). A BoW is a large vector of features (usually stored as floating point values), which represent the relative frequency of occurrence of a given word/term in each document. Typically, we have a large number of features, many of which may be non-informative for classification tasks and thus the need for feature transformation, reduction, and selection arises. In this paper, we propose two efficient algorithms for feature transformation and reduction for BoW-like representations. The proposed algorithms rely on simple statistical analysis of the input pattern, exploiting the BoW and its binary version. The algorithms are evaluated with support vector machine (SVM) and AdaBoost classifiers on standard benchmark datasets. The experimental results show the adequacy of the reduced/transformed binary features for text classification problems as well as the improvement on the test set error rate, using the proposed methods.

Item metadata

id	RCAP_52b5beb7ea922766a23785d7ba0e94a1
oai_identifier_str	oai:repositorio.ipl.pt:10400.21/17914
network_acronym_str	RCAP
network_name_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str	https://opendoar.ac.uk/repository/7160
spelling	Feature transformation and reduction for text classificationtext classificationbag-of-words (BoW)Text classification is an important tool for many applications, in su pervised, semi-supervised, and unsupervised scenarios. In order to be processed by machine learning methods, a text (document) is usually represented as a bag-of-words (BoW). A BoW is a large vector of features (usually stored as floating point values), which represent the relative frequency of occurrence of a given word/term in each document. Typically, we have a large number of features, many of which may be non-informative for classification tasks and thus the need for feature transformation, reduction, and selection arises. In this paper, we propose two efficient algorithms for feature transformation and reduction for BoW-like representations. The proposed algorithms rely on simple statistical analysis of the input pattern, exploiting the BoW and its binary version. The algorithms are evaluated with support vector machine (SVM) and AdaBoost classifiers on standard benchmark datasets. The experimental results show the adequacy of the reduced/transformed binary features for text classification problems as well as the improvement on the test set error rate, using the proposed methods.SciTePressRCIPLJ. Ferreira, ArturFigueiredo, Mario2024-11-18T09:39:54Z20102010-01-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/10400.21/17914eng978-989-8425-14-0info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-12T08:54:02Zoai:repositorio.ipl.pt:10400.21/17914Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T19:58:01.836981Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv	Feature transformation and reduction for text classification
title	Feature transformation and reduction for text classification
spellingShingle	Feature transformation and reduction for text classification J. Ferreira, Artur text classification bag-of-words (BoW)
title_short	Feature transformation and reduction for text classification
title_full	Feature transformation and reduction for text classification
title_fullStr	Feature transformation and reduction for text classification
title_full_unstemmed	Feature transformation and reduction for text classification
title_sort	Feature transformation and reduction for text classification
author	J. Ferreira, Artur
author_facet	J. Ferreira, Artur Figueiredo, Mario
author_role	author
author2	Figueiredo, Mario
author2_role	author
dc.contributor.none.fl_str_mv	RCIPL
dc.contributor.author.fl_str_mv	J. Ferreira, Artur Figueiredo, Mario
dc.subject.por.fl_str_mv	text classification bag-of-words (BoW)
topic	text classification bag-of-words (BoW)
description	Text classification is an important tool for many applications, in su pervised, semi-supervised, and unsupervised scenarios. In order to be processed by machine learning methods, a text (document) is usually represented as a bag-of-words (BoW). A BoW is a large vector of features (usually stored as floating point values), which represent the relative frequency of occurrence of a given word/term in each document. Typically, we have a large number of features, many of which may be non-informative for classification tasks and thus the need for feature transformation, reduction, and selection arises. In this paper, we propose two efficient algorithms for feature transformation and reduction for BoW-like representations. The proposed algorithms rely on simple statistical analysis of the input pattern, exploiting the BoW and its binary version. The algorithms are evaluated with support vector machine (SVM) and AdaBoost classifiers on standard benchmark datasets. The experimental results show the adequacy of the reduced/transformed binary features for text classification problems as well as the improvement on the test set error rate, using the proposed methods.
publishDate	2010
dc.date.none.fl_str_mv	2010 2010-01-01T00:00:00Z 2024-11-18T09:39:54Z
dc.type.driver.fl_str_mv	conference object
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10400.21/17914
url	http://hdl.handle.net/10400.21/17914
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	978-989-8425-14-0
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	SciTePress
publisher.none.fl_str_mv	SciTePress
dc.source.none.fl_str_mv	reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP
instname_str	FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv	info@rcaap.pt
_version_	1833598420381597696

Feature transformation and reduction for text classification

Similar Items