Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization
| Autor(a) principal: | |
|---|---|
| Data de Publicação: | 2024 |
| Tipo de documento: | Dissertação |
| Idioma: | eng |
| Título da fonte: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| Texto Completo: | http://hdl.handle.net/10362/174838 |
Resumo: | Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science |
| id |
RCAP_bd6b478a8b5aa3659168ba71520bbb15 |
|---|---|
| oai_identifier_str |
oai:run.unl.pt:10362/174838 |
| network_acronym_str |
RCAP |
| network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository_id_str |
https://opendoar.ac.uk/repository/7160 |
| spelling |
Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES TokenizationNatural Language ProcessingChemical Language ModelingSMILES RepresentationSELFIES RepresentationAtom Pair EncodingChemical InformaticsComputational ChemistryDomínio/Área Científica::Ciências Naturais::Ciências da Computação e da InformaçãoDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceLife sciences research and experimentation are resource‐intensive, requiring extensive trials and considerable time. Often, experiments do not achieve their intended objectives, but progress is made through trial and error, eventually leading to breakthroughs. Machine learning is trans‐ forming this traditional approach, providing methods to expedite processes and accelerate dis‐ coveries. Deep Learning is becoming increasingly prominent in chemistry, with Convolutional Graph Networks (CGN) being a key focus, though other approaches also show significant po‐ tential. This research explores the application of Natural Language Processing (NLP) to evaluate the effectiveness of two chemical language representations, Simplified Molecular‐Input Line‐entry System (SMILES) and SELF‐referencing Embedded Strings (SELFIES), using tokenization methods such as Byte Pair Encoding (BPE) and a novel approach developed in this study, Atoms Pair En‐ coding (APE), in BERT‐based models. The primary objective is to assess how these tokenization techniques influence the performance of chemical language models in biophysics and physiol‐ ogy classification tasks. The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE by preserving the integrity and contextual relationships among chemical ele‐ ments, thereby enhancing classification accuracy. Performance was evaluated in downstream classification tasks using three distinct datasets for HIV, toxicology, and blood‐brain barrier pen‐ etration, with ROC‐AUC serving as the evaluation metric. This study highlights the critical role of tokenization in processing chemical language and sug‐ gests that refining these techniques could lead to significant advancements in drug discovery and material science.Castelli, MauroPeres, Fernando Augusto JunqueiraRUNMayuare, Miguelangel Leon2024-11-08T12:29:58Z2024-10-292024-10-29T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/174838TID:203784430enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-01-13T01:42:06Zoai:run.unl.pt:10362/174838Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T19:13:02.224773Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
| dc.title.none.fl_str_mv |
Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization |
| title |
Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization |
| spellingShingle |
Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization Mayuare, Miguelangel Leon Natural Language Processing Chemical Language Modeling SMILES Representation SELFIES Representation Atom Pair Encoding Chemical Informatics Computational Chemistry Domínio/Área Científica::Ciências Naturais::Ciências da Computação e da Informação |
| title_short |
Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization |
| title_full |
Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization |
| title_fullStr |
Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization |
| title_full_unstemmed |
Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization |
| title_sort |
Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization |
| author |
Mayuare, Miguelangel Leon |
| author_facet |
Mayuare, Miguelangel Leon |
| author_role |
author |
| dc.contributor.none.fl_str_mv |
Castelli, Mauro Peres, Fernando Augusto Junqueira RUN |
| dc.contributor.author.fl_str_mv |
Mayuare, Miguelangel Leon |
| dc.subject.por.fl_str_mv |
Natural Language Processing Chemical Language Modeling SMILES Representation SELFIES Representation Atom Pair Encoding Chemical Informatics Computational Chemistry Domínio/Área Científica::Ciências Naturais::Ciências da Computação e da Informação |
| topic |
Natural Language Processing Chemical Language Modeling SMILES Representation SELFIES Representation Atom Pair Encoding Chemical Informatics Computational Chemistry Domínio/Área Científica::Ciências Naturais::Ciências da Computação e da Informação |
| description |
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science |
| publishDate |
2024 |
| dc.date.none.fl_str_mv |
2024-11-08T12:29:58Z 2024-10-29 2024-10-29T00:00:00Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
| format |
masterThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/174838 TID:203784430 |
| url |
http://hdl.handle.net/10362/174838 |
| identifier_str_mv |
TID:203784430 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
| instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| instacron_str |
RCAAP |
| institution |
RCAAP |
| reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| repository.mail.fl_str_mv |
info@rcaap.pt |
| _version_ |
1833597948865282048 |