Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling

Detalhes bibliográficos
Autor(a) principal: Leon, Miguelangel
Data de Publicação: 2024
Outros Autores: Perezhohin, Yuriy, Peres, Fernando, Popovic, Ales, Castelli, Mauro
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Texto Completo: http://hdl.handle.net/10362/174019
Resumo: Leon, M., Perezhohin, Y., Peres, F., Popovic, A., & Castelli, M. (2024). Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling. Scientific Reports, 14, Article 25016. https://doi.org/10.1038/s41598-024-76440-8 --- This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442
id RCAP_c28a73a0cbe3afcabc71ca01b06f0651
oai_identifier_str oai:run.unl.pt:10362/174019
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language ModelingNatural Language ProcessingChemical Language ModelingSMILES RepresentationSELFIES RepresentationAtom Pair EncodingChemical InformaticsComputational ChemistryGeneralSDG 3 - Good Health and Well-beingLeon, M., Perezhohin, Y., Peres, F., Popovic, A., & Castelli, M. (2024). Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling. Scientific Reports, 14, Article 25016. https://doi.org/10.1038/s41598-024-76440-8 --- This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442Life sciences research and experimentation are resource-intensive, requiring extensive trials and considerable time. Often, experiments do not achieve their intended objectives, but progress is made through trial and error, eventually leading to breakthroughs. Machine learning is transforming this traditional approach, providing methods to expedite processes and accelerate discoveries. Deep Learning is becoming increasingly prominent in chemistry, with Convolutional Graph Networks (CGN) being a key focus, though other approaches also show significant potential. This research explores the application of Natural Language Processing (NLP) to evaluate the effectiveness of chemical language representations, specifically SMILES and SELFIES, using tokenization methods such as Byte Pair Encoding (BPE) and a novel approach developed in this study, Atom Pair Encoding (APE), in BERT-based models. The primary objective is to assess how these tokenization techniques influence the performance of chemical language models in biophysics and physiology classification tasks. The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE by preserving the integrity and contextual relationships among chemical elements, thereby enhancing classification accuracy. Performance was evaluated in downstream classification tasks using three distinct datasets for HIV, toxicology, and blood–brain barrier penetration, with ROC-AUC serving as the evaluation metric. This study highlights the critical role of tokenization in processing chemical language and suggests that refining these techniques could lead to significant advancements in drug discovery and material science.NOVA Information Management School (NOVA IMS)Information Management Research Center (MagIC) - NOVA Information Management SchoolRUNLeon, MiguelangelPerezhohin, YuriyPeres, FernandoPopovic, AlesCastelli, Mauro2024-10-24T23:19:13Z2024-12-312024-12-31T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/article13application/pdfhttp://hdl.handle.net/10362/174019eng2045-2322PURE: 101455268https://doi.org/10.1038/s41598-024-76440-8info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-01-06T01:34:28Zoai:run.unl.pt:10362/174019Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T19:02:41.616313Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling
title Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling
spellingShingle Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling
Leon, Miguelangel
Natural Language Processing
Chemical Language Modeling
SMILES Representation
SELFIES Representation
Atom Pair Encoding
Chemical Informatics
Computational Chemistry
General
SDG 3 - Good Health and Well-being
title_short Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling
title_full Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling
title_fullStr Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling
title_full_unstemmed Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling
title_sort Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling
author Leon, Miguelangel
author_facet Leon, Miguelangel
Perezhohin, Yuriy
Peres, Fernando
Popovic, Ales
Castelli, Mauro
author_role author
author2 Perezhohin, Yuriy
Peres, Fernando
Popovic, Ales
Castelli, Mauro
author2_role author
author
author
author
dc.contributor.none.fl_str_mv NOVA Information Management School (NOVA IMS)
Information Management Research Center (MagIC) - NOVA Information Management School
RUN
dc.contributor.author.fl_str_mv Leon, Miguelangel
Perezhohin, Yuriy
Peres, Fernando
Popovic, Ales
Castelli, Mauro
dc.subject.por.fl_str_mv Natural Language Processing
Chemical Language Modeling
SMILES Representation
SELFIES Representation
Atom Pair Encoding
Chemical Informatics
Computational Chemistry
General
SDG 3 - Good Health and Well-being
topic Natural Language Processing
Chemical Language Modeling
SMILES Representation
SELFIES Representation
Atom Pair Encoding
Chemical Informatics
Computational Chemistry
General
SDG 3 - Good Health and Well-being
description Leon, M., Perezhohin, Y., Peres, F., Popovic, A., & Castelli, M. (2024). Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling. Scientific Reports, 14, Article 25016. https://doi.org/10.1038/s41598-024-76440-8 --- This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442
publishDate 2024
dc.date.none.fl_str_mv 2024-10-24T23:19:13Z
2024-12-31
2024-12-31T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/174019
url http://hdl.handle.net/10362/174019
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 2045-2322
PURE: 101455268
https://doi.org/10.1038/s41598-024-76440-8
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv 13
application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833597836736856064