Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization

Detalhes bibliográficos
Autor(a) principal: Mayuare, Miguelangel Leon
Data de Publicação: 2024
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Texto Completo: http://hdl.handle.net/10362/174838
Resumo: Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science
id RCAP_bd6b478a8b5aa3659168ba71520bbb15
oai_identifier_str oai:run.unl.pt:10362/174838
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES TokenizationNatural Language ProcessingChemical Language ModelingSMILES RepresentationSELFIES RepresentationAtom Pair EncodingChemical InformaticsComputational ChemistryDomínio/Área Científica::Ciências Naturais::Ciências da Computação e da InformaçãoDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceLife sciences research and experimentation are resource‐intensive, requiring extensive trials and considerable time. Often, experiments do not achieve their intended objectives, but progress is made through trial and error, eventually leading to breakthroughs. Machine learning is trans‐ forming this traditional approach, providing methods to expedite processes and accelerate dis‐ coveries. Deep Learning is becoming increasingly prominent in chemistry, with Convolutional Graph Networks (CGN) being a key focus, though other approaches also show significant po‐ tential. This research explores the application of Natural Language Processing (NLP) to evaluate the effectiveness of two chemical language representations, Simplified Molecular‐Input Line‐entry System (SMILES) and SELF‐referencing Embedded Strings (SELFIES), using tokenization methods such as Byte Pair Encoding (BPE) and a novel approach developed in this study, Atoms Pair En‐ coding (APE), in BERT‐based models. The primary objective is to assess how these tokenization techniques influence the performance of chemical language models in biophysics and physiol‐ ogy classification tasks. The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE by preserving the integrity and contextual relationships among chemical ele‐ ments, thereby enhancing classification accuracy. Performance was evaluated in downstream classification tasks using three distinct datasets for HIV, toxicology, and blood‐brain barrier pen‐ etration, with ROC‐AUC serving as the evaluation metric. This study highlights the critical role of tokenization in processing chemical language and sug‐ gests that refining these techniques could lead to significant advancements in drug discovery and material science.Castelli, MauroPeres, Fernando Augusto JunqueiraRUNMayuare, Miguelangel Leon2024-11-08T12:29:58Z2024-10-292024-10-29T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/174838TID:203784430enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-01-13T01:42:06Zoai:run.unl.pt:10362/174838Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T19:13:02.224773Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization
title Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization
spellingShingle Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization
Mayuare, Miguelangel Leon
Natural Language Processing
Chemical Language Modeling
SMILES Representation
SELFIES Representation
Atom Pair Encoding
Chemical Informatics
Computational Chemistry
Domínio/Área Científica::Ciências Naturais::Ciências da Computação e da Informação
title_short Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization
title_full Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization
title_fullStr Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization
title_full_unstemmed Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization
title_sort Chemical Language Modeling: A Comparative Approach to SMILES and SELFIES Tokenization
author Mayuare, Miguelangel Leon
author_facet Mayuare, Miguelangel Leon
author_role author
dc.contributor.none.fl_str_mv Castelli, Mauro
Peres, Fernando Augusto Junqueira
RUN
dc.contributor.author.fl_str_mv Mayuare, Miguelangel Leon
dc.subject.por.fl_str_mv Natural Language Processing
Chemical Language Modeling
SMILES Representation
SELFIES Representation
Atom Pair Encoding
Chemical Informatics
Computational Chemistry
Domínio/Área Científica::Ciências Naturais::Ciências da Computação e da Informação
topic Natural Language Processing
Chemical Language Modeling
SMILES Representation
SELFIES Representation
Atom Pair Encoding
Chemical Informatics
Computational Chemistry
Domínio/Área Científica::Ciências Naturais::Ciências da Computação e da Informação
description Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science
publishDate 2024
dc.date.none.fl_str_mv 2024-11-08T12:29:58Z
2024-10-29
2024-10-29T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/174838
TID:203784430
url http://hdl.handle.net/10362/174838
identifier_str_mv TID:203784430
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833597948865282048