Text based classification of companies in CrunchBase
Main Author: | |
---|---|
Publication Date: | 2015 |
Other Authors: | |
Language: | eng |
Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
Download full: | http://hdl.handle.net/10071/25098 |
Summary: | This paper introduces two fuzzy fingerprint based text classification techniques that were successfully applied to automatically label companies from CrunchBase, based purely on their unstructured textual description. This is a real and very challenging problem due to the large set of possible labels (more than 40) and also to the fact that the textual descriptions do not have to abide by any criteria and are, therefore, extremely heterogeneous. Fuzzy fingerprints are a recently introduced technique that can be used for performing fast classification. They perform well in the presence of unbalanced datasets and can cope with a very large number of classes. In the paper, a comparison is performed against some of the best text classification techniques commonly used to address similar problems. When applied to the CrunchBase dataset, the fuzzy fingerprint based approach outperformed the other techniques. |
id |
RCAP_8905ea78b3ad7f127cf8a83782c116d9 |
---|---|
oai_identifier_str |
oai:repositorio.iscte-iul.pt:10071/25098 |
network_acronym_str |
RCAP |
network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository_id_str |
https://opendoar.ac.uk/repository/7160 |
spelling |
Text based classification of companies in CrunchBaseText classificationFuzzy fingerprintsText miningCrunchbaseDocument classificationThis paper introduces two fuzzy fingerprint based text classification techniques that were successfully applied to automatically label companies from CrunchBase, based purely on their unstructured textual description. This is a real and very challenging problem due to the large set of possible labels (more than 40) and also to the fact that the textual descriptions do not have to abide by any criteria and are, therefore, extremely heterogeneous. Fuzzy fingerprints are a recently introduced technique that can be used for performing fast classification. They perform well in the presence of unbalanced datasets and can cope with a very large number of classes. In the paper, a comparison is performed against some of the best text classification techniques commonly used to address similar problems. When applied to the CrunchBase dataset, the fuzzy fingerprint based approach outperformed the other techniques.IEEE2022-04-08T09:25:46Z2015-01-01T00:00:00Z20152022-04-08T10:22:26Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/10071/25098eng978-1-4673-7428-61544-561510.1109/FUZZ-IEEE.2015.7337892Batista, F.João P. Carvalhoinfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-07-07T03:20:18Zoai:repositorio.iscte-iul.pt:10071/25098Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T18:21:01.760662Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
dc.title.none.fl_str_mv |
Text based classification of companies in CrunchBase |
title |
Text based classification of companies in CrunchBase |
spellingShingle |
Text based classification of companies in CrunchBase Batista, F. Text classification Fuzzy fingerprints Text mining Crunchbase Document classification |
title_short |
Text based classification of companies in CrunchBase |
title_full |
Text based classification of companies in CrunchBase |
title_fullStr |
Text based classification of companies in CrunchBase |
title_full_unstemmed |
Text based classification of companies in CrunchBase |
title_sort |
Text based classification of companies in CrunchBase |
author |
Batista, F. |
author_facet |
Batista, F. João P. Carvalho |
author_role |
author |
author2 |
João P. Carvalho |
author2_role |
author |
dc.contributor.author.fl_str_mv |
Batista, F. João P. Carvalho |
dc.subject.por.fl_str_mv |
Text classification Fuzzy fingerprints Text mining Crunchbase Document classification |
topic |
Text classification Fuzzy fingerprints Text mining Crunchbase Document classification |
description |
This paper introduces two fuzzy fingerprint based text classification techniques that were successfully applied to automatically label companies from CrunchBase, based purely on their unstructured textual description. This is a real and very challenging problem due to the large set of possible labels (more than 40) and also to the fact that the textual descriptions do not have to abide by any criteria and are, therefore, extremely heterogeneous. Fuzzy fingerprints are a recently introduced technique that can be used for performing fast classification. They perform well in the presence of unbalanced datasets and can cope with a very large number of classes. In the paper, a comparison is performed against some of the best text classification techniques commonly used to address similar problems. When applied to the CrunchBase dataset, the fuzzy fingerprint based approach outperformed the other techniques. |
publishDate |
2015 |
dc.date.none.fl_str_mv |
2015-01-01T00:00:00Z 2015 2022-04-08T09:25:46Z 2022-04-08T10:22:26Z |
dc.type.driver.fl_str_mv |
conference object |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10071/25098 |
url |
http://hdl.handle.net/10071/25098 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
978-1-4673-7428-6 1544-5615 10.1109/FUZZ-IEEE.2015.7337892 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
IEEE |
publisher.none.fl_str_mv |
IEEE |
dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
repository.mail.fl_str_mv |
info@rcaap.pt |
_version_ |
1833597351457980416 |