Feature transformation and reduction for text classification

Bibliographic Details
Main Author: J. Ferreira, Artur
Publication Date: 2010
Other Authors: Figueiredo, Mario
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10400.21/17914
Summary: Text classification is an important tool for many applications, in su pervised, semi-supervised, and unsupervised scenarios. In order to be processed by machine learning methods, a text (document) is usually represented as a bag-of-words (BoW). A BoW is a large vector of features (usually stored as floating point values), which represent the relative frequency of occurrence of a given word/term in each document. Typically, we have a large number of features, many of which may be non-informative for classification tasks and thus the need for feature transformation, reduction, and selection arises. In this paper, we propose two efficient algorithms for feature transformation and reduction for BoW-like representations. The proposed algorithms rely on simple statistical analysis of the input pattern, exploiting the BoW and its binary version. The algorithms are evaluated with support vector machine (SVM) and AdaBoost classifiers on standard benchmark datasets. The experimental results show the adequacy of the reduced/transformed binary features for text classification problems as well as the improvement on the test set error rate, using the proposed methods.
id RCAP_52b5beb7ea922766a23785d7ba0e94a1
oai_identifier_str oai:repositorio.ipl.pt:10400.21/17914
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Feature transformation and reduction for text classificationtext classificationbag-of-words (BoW)Text classification is an important tool for many applications, in su pervised, semi-supervised, and unsupervised scenarios. In order to be processed by machine learning methods, a text (document) is usually represented as a bag-of-words (BoW). A BoW is a large vector of features (usually stored as floating point values), which represent the relative frequency of occurrence of a given word/term in each document. Typically, we have a large number of features, many of which may be non-informative for classification tasks and thus the need for feature transformation, reduction, and selection arises. In this paper, we propose two efficient algorithms for feature transformation and reduction for BoW-like representations. The proposed algorithms rely on simple statistical analysis of the input pattern, exploiting the BoW and its binary version. The algorithms are evaluated with support vector machine (SVM) and AdaBoost classifiers on standard benchmark datasets. The experimental results show the adequacy of the reduced/transformed binary features for text classification problems as well as the improvement on the test set error rate, using the proposed methods.SciTePressRCIPLJ. Ferreira, ArturFigueiredo, Mario2024-11-18T09:39:54Z20102010-01-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/10400.21/17914eng978-989-8425-14-0info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-12T08:54:02Zoai:repositorio.ipl.pt:10400.21/17914Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T19:58:01.836981Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Feature transformation and reduction for text classification
title Feature transformation and reduction for text classification
spellingShingle Feature transformation and reduction for text classification
J. Ferreira, Artur
text classification
bag-of-words (BoW)
title_short Feature transformation and reduction for text classification
title_full Feature transformation and reduction for text classification
title_fullStr Feature transformation and reduction for text classification
title_full_unstemmed Feature transformation and reduction for text classification
title_sort Feature transformation and reduction for text classification
author J. Ferreira, Artur
author_facet J. Ferreira, Artur
Figueiredo, Mario
author_role author
author2 Figueiredo, Mario
author2_role author
dc.contributor.none.fl_str_mv RCIPL
dc.contributor.author.fl_str_mv J. Ferreira, Artur
Figueiredo, Mario
dc.subject.por.fl_str_mv text classification
bag-of-words (BoW)
topic text classification
bag-of-words (BoW)
description Text classification is an important tool for many applications, in su pervised, semi-supervised, and unsupervised scenarios. In order to be processed by machine learning methods, a text (document) is usually represented as a bag-of-words (BoW). A BoW is a large vector of features (usually stored as floating point values), which represent the relative frequency of occurrence of a given word/term in each document. Typically, we have a large number of features, many of which may be non-informative for classification tasks and thus the need for feature transformation, reduction, and selection arises. In this paper, we propose two efficient algorithms for feature transformation and reduction for BoW-like representations. The proposed algorithms rely on simple statistical analysis of the input pattern, exploiting the BoW and its binary version. The algorithms are evaluated with support vector machine (SVM) and AdaBoost classifiers on standard benchmark datasets. The experimental results show the adequacy of the reduced/transformed binary features for text classification problems as well as the improvement on the test set error rate, using the proposed methods.
publishDate 2010
dc.date.none.fl_str_mv 2010
2010-01-01T00:00:00Z
2024-11-18T09:39:54Z
dc.type.driver.fl_str_mv conference object
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.21/17914
url http://hdl.handle.net/10400.21/17914
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 978-989-8425-14-0
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv SciTePress
publisher.none.fl_str_mv SciTePress
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833598420381597696