A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations
Main Author: | |
---|---|
Publication Date: | 2022 |
Language: | eng |
Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
Download full: | http://hdl.handle.net/10451/59848 |
Summary: | The ability to sense and react to external and internal pH changes is a survival requirement for any cell. pH homeostasis is tightly regulated, and even minor disruptions can severely impact cell metabolism, function, and survival. The pH dependence of proteins can be attributed to only 7 out of the 20 canonical amino acids, the titratable amino acids that can exchange protons with water in the usual 0-14 pH range. These amino acids make up for approximately 31% of all amino acids in the human proteome, meaning that, on average, roughly one-third of each protein is sensitive not only to the medium pH but also to alterations in the electrostatics of its surroundings. Unsurprisingly, protonation switches have been associated with a wide array of protein behaviors, including modulating the binding affinity in protein-protein, protein-ligand, or protein-lipid systems, modifying enzymatic activity and function, and even altering their stability and subcellular location. Despite its importance, our molecular understanding of pHdependent effects in proteins and other biomolecules is still very limited, particularly in big macromolecular complexes such as protein-protein or membrane protein systems. Over the years, several classes of methods have been developed to provide molecular insights into the protonation preference and dependence of biomolecules. Empirical methods offer cheap and competitive predictions for time- or resource-constrained situations. Albeit more computationally expensive, continuum electrostatics-based are a cost-effective solution for estimating microscopic equilibrium constants, pKhalf and macroscopic pKa. To study pH-dependent conformational transitions, constant-pH molecular dynamics (CpHMD) is the appropriate methodology. Unfortunately, given the computational cost and, in many cases, the difficulty associated with using CE-based and CpHMD, most researchers overuse empirical methods or neglect the effect of pH in their studies. Here, we address these issues by proposing multiple pKa predictor methods and tools with different levels of theory designed to be faster and accessible to more users. First, we introduced PypKa, a flexible tool to predict Poisson–Boltzmann/Monte Carlo-based (PB/MC) pKa values of titratable sites in proteins. It was validated with a large set of experimental values exhibiting a competitive performance. PypKa supports CPU parallel computing and can be used directly on proteins obtained from the Protein Data Bank (PDB) repository or molecular dynamics (MD) simulations. A simple, reusable, and extensible Python API is provided, allowing pKa calculations to be easily added to existing protocols with a few extra lines of code. This capability was exploited in the development of PypKa-MD, an easy-to-use implementation of the stochastic titration CpHMD method. PypKa-MD supports GROMOS and CHARMM force fields, as well as modern versions of GROMACS. Using PypKa’s API and consequent abstraction of PB/MC contributed to its greatly simplified modular architecture that will serve as the foundation for future developments. The new implementation was validated on alanine-based tetrapeptides with closely interacting titratable residues and four commonly used benchmark proteins, displaying highly similar and correlated pKa predictions compared to a previously validated implementation. Like most structural-based computational studies, the majority of pKa calculations are performed on experimental structures deposited in the PDB. Furthermore, there is an ever-growing imbalance between scarce experimental pKa values and the increasingly higher number of resolved structures. To save countless hours and resources that would be spent on repeated calculations, we have released pKPDB, a database of over 12M theoretical pKa values obtained by running PypKa over 120k protein structures from the PDB. The precomputed pKa estimations can be retrieved instantaneously via our web application, the PypKa Server. In case the protein of interest is not in the pKPDB, the user may easily run PypKa in the cloud either by uploading a custom structure or submitting an identifier code from the PBD or UniProtKB. It is also possible to use the server to get structures with representative pH-dependent protonation states to be used in other computational methods such as molecular dynamics. The advent of artificial intelligence in biological sciences presented an opportunity to drastically accelerate pKa predictors using our previously generated database of pKa values. With pKAI, we introduced the first deep learning-based predictor of pKa shifts in proteins trained on continuum electrostatics data. By combining a reasonable understanding of the underlying physics, an accuracy comparable to that of physics-based methods, and inference time speedups of more than 1000 ×, pKAI provided a game-changing solution for fast estimations of macroscopic pKa from ensembles of microscopic values. However, several limitations needed to be addressed before its integration within the CpHMD framework as a replacement for PypKa. Hence, we proposed a new graph neural network for protein pKa predictions suitable for CpHMD, pKAI-MD. This model estimates pH-independent energies to be used in a Monte Carlo routine to sample representative microscopic protonation states. While developing the new model, we explored different graph representations of proteins using multiple electrostatics-driven properties. While there are certainly many new features to be introduced and a multitude of development to be expanded, the selection of methods and tools presented in this work poses a significant improvement over the alternatives and effectively constitutes a new generation of user-friendly and machine learning-accelerated methods for pKa calculations. |
id |
RCAP_a8f9ff6a06761f7ddfd4e36f2c0cdaad |
---|---|
oai_identifier_str |
oai:repositorio.ulisboa.pt:10451/59848 |
network_acronym_str |
RCAP |
network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository_id_str |
https://opendoar.ac.uk/repository/7160 |
spelling |
A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculationsPKaProtonaçãoPh constanteMachine learningProtonationConstant-pHDomínio/Área Científica::Ciências Naturais::Ciências BiológicasThe ability to sense and react to external and internal pH changes is a survival requirement for any cell. pH homeostasis is tightly regulated, and even minor disruptions can severely impact cell metabolism, function, and survival. The pH dependence of proteins can be attributed to only 7 out of the 20 canonical amino acids, the titratable amino acids that can exchange protons with water in the usual 0-14 pH range. These amino acids make up for approximately 31% of all amino acids in the human proteome, meaning that, on average, roughly one-third of each protein is sensitive not only to the medium pH but also to alterations in the electrostatics of its surroundings. Unsurprisingly, protonation switches have been associated with a wide array of protein behaviors, including modulating the binding affinity in protein-protein, protein-ligand, or protein-lipid systems, modifying enzymatic activity and function, and even altering their stability and subcellular location. Despite its importance, our molecular understanding of pHdependent effects in proteins and other biomolecules is still very limited, particularly in big macromolecular complexes such as protein-protein or membrane protein systems. Over the years, several classes of methods have been developed to provide molecular insights into the protonation preference and dependence of biomolecules. Empirical methods offer cheap and competitive predictions for time- or resource-constrained situations. Albeit more computationally expensive, continuum electrostatics-based are a cost-effective solution for estimating microscopic equilibrium constants, pKhalf and macroscopic pKa. To study pH-dependent conformational transitions, constant-pH molecular dynamics (CpHMD) is the appropriate methodology. Unfortunately, given the computational cost and, in many cases, the difficulty associated with using CE-based and CpHMD, most researchers overuse empirical methods or neglect the effect of pH in their studies. Here, we address these issues by proposing multiple pKa predictor methods and tools with different levels of theory designed to be faster and accessible to more users. First, we introduced PypKa, a flexible tool to predict Poisson–Boltzmann/Monte Carlo-based (PB/MC) pKa values of titratable sites in proteins. It was validated with a large set of experimental values exhibiting a competitive performance. PypKa supports CPU parallel computing and can be used directly on proteins obtained from the Protein Data Bank (PDB) repository or molecular dynamics (MD) simulations. A simple, reusable, and extensible Python API is provided, allowing pKa calculations to be easily added to existing protocols with a few extra lines of code. This capability was exploited in the development of PypKa-MD, an easy-to-use implementation of the stochastic titration CpHMD method. PypKa-MD supports GROMOS and CHARMM force fields, as well as modern versions of GROMACS. Using PypKa’s API and consequent abstraction of PB/MC contributed to its greatly simplified modular architecture that will serve as the foundation for future developments. The new implementation was validated on alanine-based tetrapeptides with closely interacting titratable residues and four commonly used benchmark proteins, displaying highly similar and correlated pKa predictions compared to a previously validated implementation. Like most structural-based computational studies, the majority of pKa calculations are performed on experimental structures deposited in the PDB. Furthermore, there is an ever-growing imbalance between scarce experimental pKa values and the increasingly higher number of resolved structures. To save countless hours and resources that would be spent on repeated calculations, we have released pKPDB, a database of over 12M theoretical pKa values obtained by running PypKa over 120k protein structures from the PDB. The precomputed pKa estimations can be retrieved instantaneously via our web application, the PypKa Server. In case the protein of interest is not in the pKPDB, the user may easily run PypKa in the cloud either by uploading a custom structure or submitting an identifier code from the PBD or UniProtKB. It is also possible to use the server to get structures with representative pH-dependent protonation states to be used in other computational methods such as molecular dynamics. The advent of artificial intelligence in biological sciences presented an opportunity to drastically accelerate pKa predictors using our previously generated database of pKa values. With pKAI, we introduced the first deep learning-based predictor of pKa shifts in proteins trained on continuum electrostatics data. By combining a reasonable understanding of the underlying physics, an accuracy comparable to that of physics-based methods, and inference time speedups of more than 1000 ×, pKAI provided a game-changing solution for fast estimations of macroscopic pKa from ensembles of microscopic values. However, several limitations needed to be addressed before its integration within the CpHMD framework as a replacement for PypKa. Hence, we proposed a new graph neural network for protein pKa predictions suitable for CpHMD, pKAI-MD. This model estimates pH-independent energies to be used in a Monte Carlo routine to sample representative microscopic protonation states. While developing the new model, we explored different graph representations of proteins using multiple electrostatics-driven properties. While there are certainly many new features to be introduced and a multitude of development to be expanded, the selection of methods and tools presented in this work poses a significant improvement over the alternatives and effectively constitutes a new generation of user-friendly and machine learning-accelerated methods for pKa calculations.Machuqueiro, Miguel Ângelo dos SantosViçosa, Diogo Ruivo dos Santos VilaRocchia, WalterRepositório da Universidade de LisboaReis, Pedro B P S2023-10-17T14:20:00Z2023-052022-102023-05-01T00:00:00Zdoctoral thesisinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/10451/59848TID:101661797enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-17T15:01:06Zoai:repositorio.ulisboa.pt:10451/59848Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T03:31:43.890196Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
dc.title.none.fl_str_mv |
A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations |
title |
A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations |
spellingShingle |
A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations Reis, Pedro B P S PKa Protonação Ph constante Machine learning Protonation Constant-pH Domínio/Área Científica::Ciências Naturais::Ciências Biológicas |
title_short |
A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations |
title_full |
A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations |
title_fullStr |
A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations |
title_full_unstemmed |
A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations |
title_sort |
A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations |
author |
Reis, Pedro B P S |
author_facet |
Reis, Pedro B P S |
author_role |
author |
dc.contributor.none.fl_str_mv |
Machuqueiro, Miguel Ângelo dos Santos Viçosa, Diogo Ruivo dos Santos Vila Rocchia, Walter Repositório da Universidade de Lisboa |
dc.contributor.author.fl_str_mv |
Reis, Pedro B P S |
dc.subject.por.fl_str_mv |
PKa Protonação Ph constante Machine learning Protonation Constant-pH Domínio/Área Científica::Ciências Naturais::Ciências Biológicas |
topic |
PKa Protonação Ph constante Machine learning Protonation Constant-pH Domínio/Área Científica::Ciências Naturais::Ciências Biológicas |
description |
The ability to sense and react to external and internal pH changes is a survival requirement for any cell. pH homeostasis is tightly regulated, and even minor disruptions can severely impact cell metabolism, function, and survival. The pH dependence of proteins can be attributed to only 7 out of the 20 canonical amino acids, the titratable amino acids that can exchange protons with water in the usual 0-14 pH range. These amino acids make up for approximately 31% of all amino acids in the human proteome, meaning that, on average, roughly one-third of each protein is sensitive not only to the medium pH but also to alterations in the electrostatics of its surroundings. Unsurprisingly, protonation switches have been associated with a wide array of protein behaviors, including modulating the binding affinity in protein-protein, protein-ligand, or protein-lipid systems, modifying enzymatic activity and function, and even altering their stability and subcellular location. Despite its importance, our molecular understanding of pHdependent effects in proteins and other biomolecules is still very limited, particularly in big macromolecular complexes such as protein-protein or membrane protein systems. Over the years, several classes of methods have been developed to provide molecular insights into the protonation preference and dependence of biomolecules. Empirical methods offer cheap and competitive predictions for time- or resource-constrained situations. Albeit more computationally expensive, continuum electrostatics-based are a cost-effective solution for estimating microscopic equilibrium constants, pKhalf and macroscopic pKa. To study pH-dependent conformational transitions, constant-pH molecular dynamics (CpHMD) is the appropriate methodology. Unfortunately, given the computational cost and, in many cases, the difficulty associated with using CE-based and CpHMD, most researchers overuse empirical methods or neglect the effect of pH in their studies. Here, we address these issues by proposing multiple pKa predictor methods and tools with different levels of theory designed to be faster and accessible to more users. First, we introduced PypKa, a flexible tool to predict Poisson–Boltzmann/Monte Carlo-based (PB/MC) pKa values of titratable sites in proteins. It was validated with a large set of experimental values exhibiting a competitive performance. PypKa supports CPU parallel computing and can be used directly on proteins obtained from the Protein Data Bank (PDB) repository or molecular dynamics (MD) simulations. A simple, reusable, and extensible Python API is provided, allowing pKa calculations to be easily added to existing protocols with a few extra lines of code. This capability was exploited in the development of PypKa-MD, an easy-to-use implementation of the stochastic titration CpHMD method. PypKa-MD supports GROMOS and CHARMM force fields, as well as modern versions of GROMACS. Using PypKa’s API and consequent abstraction of PB/MC contributed to its greatly simplified modular architecture that will serve as the foundation for future developments. The new implementation was validated on alanine-based tetrapeptides with closely interacting titratable residues and four commonly used benchmark proteins, displaying highly similar and correlated pKa predictions compared to a previously validated implementation. Like most structural-based computational studies, the majority of pKa calculations are performed on experimental structures deposited in the PDB. Furthermore, there is an ever-growing imbalance between scarce experimental pKa values and the increasingly higher number of resolved structures. To save countless hours and resources that would be spent on repeated calculations, we have released pKPDB, a database of over 12M theoretical pKa values obtained by running PypKa over 120k protein structures from the PDB. The precomputed pKa estimations can be retrieved instantaneously via our web application, the PypKa Server. In case the protein of interest is not in the pKPDB, the user may easily run PypKa in the cloud either by uploading a custom structure or submitting an identifier code from the PBD or UniProtKB. It is also possible to use the server to get structures with representative pH-dependent protonation states to be used in other computational methods such as molecular dynamics. The advent of artificial intelligence in biological sciences presented an opportunity to drastically accelerate pKa predictors using our previously generated database of pKa values. With pKAI, we introduced the first deep learning-based predictor of pKa shifts in proteins trained on continuum electrostatics data. By combining a reasonable understanding of the underlying physics, an accuracy comparable to that of physics-based methods, and inference time speedups of more than 1000 ×, pKAI provided a game-changing solution for fast estimations of macroscopic pKa from ensembles of microscopic values. However, several limitations needed to be addressed before its integration within the CpHMD framework as a replacement for PypKa. Hence, we proposed a new graph neural network for protein pKa predictions suitable for CpHMD, pKAI-MD. This model estimates pH-independent energies to be used in a Monte Carlo routine to sample representative microscopic protonation states. While developing the new model, we explored different graph representations of proteins using multiple electrostatics-driven properties. While there are certainly many new features to be introduced and a multitude of development to be expanded, the selection of methods and tools presented in this work poses a significant improvement over the alternatives and effectively constitutes a new generation of user-friendly and machine learning-accelerated methods for pKa calculations. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-10 2023-10-17T14:20:00Z 2023-05 2023-05-01T00:00:00Z |
dc.type.driver.fl_str_mv |
doctoral thesis |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10451/59848 TID:101661797 |
url |
http://hdl.handle.net/10451/59848 |
identifier_str_mv |
TID:101661797 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
repository.mail.fl_str_mv |
info@rcaap.pt |
_version_ |
1833601732565794816 |