Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling
| Autor(a) principal: | |
|---|---|
| Data de Publicação: | 2024 |
| Outros Autores: | , , , |
| Tipo de documento: | Artigo |
| Idioma: | eng |
| Título da fonte: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| Texto Completo: | http://hdl.handle.net/10362/174019 |
Resumo: | Leon, M., Perezhohin, Y., Peres, F., Popovic, A., & Castelli, M. (2024). Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling. Scientific Reports, 14, Article 25016. https://doi.org/10.1038/s41598-024-76440-8 --- This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442 |
| id |
RCAP_c28a73a0cbe3afcabc71ca01b06f0651 |
|---|---|
| oai_identifier_str |
oai:run.unl.pt:10362/174019 |
| network_acronym_str |
RCAP |
| network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository_id_str |
https://opendoar.ac.uk/repository/7160 |
| spelling |
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language ModelingNatural Language ProcessingChemical Language ModelingSMILES RepresentationSELFIES RepresentationAtom Pair EncodingChemical InformaticsComputational ChemistryGeneralSDG 3 - Good Health and Well-beingLeon, M., Perezhohin, Y., Peres, F., Popovic, A., & Castelli, M. (2024). Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling. Scientific Reports, 14, Article 25016. https://doi.org/10.1038/s41598-024-76440-8 --- This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442Life sciences research and experimentation are resource-intensive, requiring extensive trials and considerable time. Often, experiments do not achieve their intended objectives, but progress is made through trial and error, eventually leading to breakthroughs. Machine learning is transforming this traditional approach, providing methods to expedite processes and accelerate discoveries. Deep Learning is becoming increasingly prominent in chemistry, with Convolutional Graph Networks (CGN) being a key focus, though other approaches also show significant potential. This research explores the application of Natural Language Processing (NLP) to evaluate the effectiveness of chemical language representations, specifically SMILES and SELFIES, using tokenization methods such as Byte Pair Encoding (BPE) and a novel approach developed in this study, Atom Pair Encoding (APE), in BERT-based models. The primary objective is to assess how these tokenization techniques influence the performance of chemical language models in biophysics and physiology classification tasks. The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE by preserving the integrity and contextual relationships among chemical elements, thereby enhancing classification accuracy. Performance was evaluated in downstream classification tasks using three distinct datasets for HIV, toxicology, and blood–brain barrier penetration, with ROC-AUC serving as the evaluation metric. This study highlights the critical role of tokenization in processing chemical language and suggests that refining these techniques could lead to significant advancements in drug discovery and material science.NOVA Information Management School (NOVA IMS)Information Management Research Center (MagIC) - NOVA Information Management SchoolRUNLeon, MiguelangelPerezhohin, YuriyPeres, FernandoPopovic, AlesCastelli, Mauro2024-10-24T23:19:13Z2024-12-312024-12-31T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/article13application/pdfhttp://hdl.handle.net/10362/174019eng2045-2322PURE: 101455268https://doi.org/10.1038/s41598-024-76440-8info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-01-06T01:34:28Zoai:run.unl.pt:10362/174019Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T19:02:41.616313Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
| dc.title.none.fl_str_mv |
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling |
| title |
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling |
| spellingShingle |
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling Leon, Miguelangel Natural Language Processing Chemical Language Modeling SMILES Representation SELFIES Representation Atom Pair Encoding Chemical Informatics Computational Chemistry General SDG 3 - Good Health and Well-being |
| title_short |
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling |
| title_full |
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling |
| title_fullStr |
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling |
| title_full_unstemmed |
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling |
| title_sort |
Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling |
| author |
Leon, Miguelangel |
| author_facet |
Leon, Miguelangel Perezhohin, Yuriy Peres, Fernando Popovic, Ales Castelli, Mauro |
| author_role |
author |
| author2 |
Perezhohin, Yuriy Peres, Fernando Popovic, Ales Castelli, Mauro |
| author2_role |
author author author author |
| dc.contributor.none.fl_str_mv |
NOVA Information Management School (NOVA IMS) Information Management Research Center (MagIC) - NOVA Information Management School RUN |
| dc.contributor.author.fl_str_mv |
Leon, Miguelangel Perezhohin, Yuriy Peres, Fernando Popovic, Ales Castelli, Mauro |
| dc.subject.por.fl_str_mv |
Natural Language Processing Chemical Language Modeling SMILES Representation SELFIES Representation Atom Pair Encoding Chemical Informatics Computational Chemistry General SDG 3 - Good Health and Well-being |
| topic |
Natural Language Processing Chemical Language Modeling SMILES Representation SELFIES Representation Atom Pair Encoding Chemical Informatics Computational Chemistry General SDG 3 - Good Health and Well-being |
| description |
Leon, M., Perezhohin, Y., Peres, F., Popovic, A., & Castelli, M. (2024). Comparing SMILES and SELFIES Tokenization for Enhanced Chemical Language Modeling. Scientific Reports, 14, Article 25016. https://doi.org/10.1038/s41598-024-76440-8 --- This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project - UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS). Aleš Popovič was supported by the Slovenian Research and Innovation Agency (ARIS) under research core funding P2-0442 |
| publishDate |
2024 |
| dc.date.none.fl_str_mv |
2024-10-24T23:19:13Z 2024-12-31 2024-12-31T00:00:00Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
| format |
article |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/174019 |
| url |
http://hdl.handle.net/10362/174019 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
2045-2322 PURE: 101455268 https://doi.org/10.1038/s41598-024-76440-8 |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
13 application/pdf |
| dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
| instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| instacron_str |
RCAAP |
| institution |
RCAAP |
| reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| repository.mail.fl_str_mv |
info@rcaap.pt |
| _version_ |
1833597836736856064 |