New PCA-based category encoder for efficient data processing in IoT devices

Bibliographic Details
Main Author: Farkhari, H.
Publication Date: 2022
Other Authors: Viana, J., Campos, L. M., Sebastião, P., Bernardo, L.
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10071/28844
Summary: Increasing the cardinality of categorical variables might decrease the overall performance of machine learning (ML) algorithms. This paper presents a novel computational preprocessing method to convert categorical to numerical variables ML algorithms. It uses a supervised binary classifier to extract additional context-related features from the categorical values. The method requires two hyperparameters: a threshold related to the distribution of categories in the variables and the PCA representativeness. This paper applies the proposed approach to the well-known cybersecurity NSLKDD dataset to select and convert three categorical features to numerical features. After choosing the threshold parameter, we use conditional probabilities to convert the three categorical variables into six new numerical variables. Next, we feed these numerical variables to the PCA algorithm and select the whole or partial numbers of the Principal Components (PCs). Finally, by applying binary classification with ten different classifiers, we measure the performance of the new encoder and compare it with the other 17 well-known category encoders. The new technique achieves the highest performance related to accuracy and Area Under the Curve (AUC) on high cardinality categorical variables. Also, we define the harmonic average metrics to find the best trade-off between train and test performances and prevent underfitting and overfitting. Ultimately, the number of newly created numerical variables is minimal. This data reduction improves computational processing time in Internet of things (IoT) devices connected to future networks.
id RCAP_c246ab9d36dd447e83cdfcce330f99fa
oai_identifier_str oai:repositorio.iscte-iul.pt:10071/28844
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling New PCA-based category encoder for efficient data processing in IoT devicesCategorical encodersDimensionality reductionInternet of thingsFeature selectionMachine learningNSLKDDPrincipal component analysesIncreasing the cardinality of categorical variables might decrease the overall performance of machine learning (ML) algorithms. This paper presents a novel computational preprocessing method to convert categorical to numerical variables ML algorithms. It uses a supervised binary classifier to extract additional context-related features from the categorical values. The method requires two hyperparameters: a threshold related to the distribution of categories in the variables and the PCA representativeness. This paper applies the proposed approach to the well-known cybersecurity NSLKDD dataset to select and convert three categorical features to numerical features. After choosing the threshold parameter, we use conditional probabilities to convert the three categorical variables into six new numerical variables. Next, we feed these numerical variables to the PCA algorithm and select the whole or partial numbers of the Principal Components (PCs). Finally, by applying binary classification with ten different classifiers, we measure the performance of the new encoder and compare it with the other 17 well-known category encoders. The new technique achieves the highest performance related to accuracy and Area Under the Curve (AUC) on high cardinality categorical variables. Also, we define the harmonic average metrics to find the best trade-off between train and test performances and prevent underfitting and overfitting. Ultimately, the number of newly created numerical variables is minimal. This data reduction improves computational processing time in Internet of things (IoT) devices connected to future networks.IEEE2023-06-30T09:39:18Z2022-01-01T00:00:00Z20222023-06-30T10:37:49Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/10071/28844eng978-1-6654-5975-410.1109/GCWkshps56602.2022.10008757Farkhari, H.Viana, J.Campos, L. M.Sebastião, P.Bernardo, L.info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-07-07T03:23:34Zoai:repositorio.iscte-iul.pt:10071/28844Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T18:22:17.010187Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv New PCA-based category encoder for efficient data processing in IoT devices
title New PCA-based category encoder for efficient data processing in IoT devices
spellingShingle New PCA-based category encoder for efficient data processing in IoT devices
Farkhari, H.
Categorical encoders
Dimensionality reduction
Internet of things
Feature selection
Machine learning
NSLKDD
Principal component analyses
title_short New PCA-based category encoder for efficient data processing in IoT devices
title_full New PCA-based category encoder for efficient data processing in IoT devices
title_fullStr New PCA-based category encoder for efficient data processing in IoT devices
title_full_unstemmed New PCA-based category encoder for efficient data processing in IoT devices
title_sort New PCA-based category encoder for efficient data processing in IoT devices
author Farkhari, H.
author_facet Farkhari, H.
Viana, J.
Campos, L. M.
Sebastião, P.
Bernardo, L.
author_role author
author2 Viana, J.
Campos, L. M.
Sebastião, P.
Bernardo, L.
author2_role author
author
author
author
dc.contributor.author.fl_str_mv Farkhari, H.
Viana, J.
Campos, L. M.
Sebastião, P.
Bernardo, L.
dc.subject.por.fl_str_mv Categorical encoders
Dimensionality reduction
Internet of things
Feature selection
Machine learning
NSLKDD
Principal component analyses
topic Categorical encoders
Dimensionality reduction
Internet of things
Feature selection
Machine learning
NSLKDD
Principal component analyses
description Increasing the cardinality of categorical variables might decrease the overall performance of machine learning (ML) algorithms. This paper presents a novel computational preprocessing method to convert categorical to numerical variables ML algorithms. It uses a supervised binary classifier to extract additional context-related features from the categorical values. The method requires two hyperparameters: a threshold related to the distribution of categories in the variables and the PCA representativeness. This paper applies the proposed approach to the well-known cybersecurity NSLKDD dataset to select and convert three categorical features to numerical features. After choosing the threshold parameter, we use conditional probabilities to convert the three categorical variables into six new numerical variables. Next, we feed these numerical variables to the PCA algorithm and select the whole or partial numbers of the Principal Components (PCs). Finally, by applying binary classification with ten different classifiers, we measure the performance of the new encoder and compare it with the other 17 well-known category encoders. The new technique achieves the highest performance related to accuracy and Area Under the Curve (AUC) on high cardinality categorical variables. Also, we define the harmonic average metrics to find the best trade-off between train and test performances and prevent underfitting and overfitting. Ultimately, the number of newly created numerical variables is minimal. This data reduction improves computational processing time in Internet of things (IoT) devices connected to future networks.
publishDate 2022
dc.date.none.fl_str_mv 2022-01-01T00:00:00Z
2022
2023-06-30T09:39:18Z
2023-06-30T10:37:49Z
dc.type.driver.fl_str_mv conference object
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10071/28844
url http://hdl.handle.net/10071/28844
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 978-1-6654-5975-4
10.1109/GCWkshps56602.2022.10008757
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv IEEE
publisher.none.fl_str_mv IEEE
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833597366776627200