A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations

Bibliographic Details
Main Author: Reis, Pedro B P S
Publication Date: 2022
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10451/59848
Summary: The ability to sense and react to external and internal pH changes is a survival requirement for any cell. pH homeostasis is tightly regulated, and even minor disruptions can severely impact cell metabolism, function, and survival. The pH dependence of proteins can be attributed to only 7 out of the 20 canonical amino acids, the titratable amino acids that can exchange protons with water in the usual 0-14 pH range. These amino acids make up for approximately 31% of all amino acids in the human proteome, meaning that, on average, roughly one-third of each protein is sensitive not only to the medium pH but also to alterations in the electrostatics of its surroundings. Unsurprisingly, protonation switches have been associated with a wide array of protein behaviors, including modulating the binding affinity in protein-protein, protein-ligand, or protein-lipid systems, modifying enzymatic activity and function, and even altering their stability and subcellular location. Despite its importance, our molecular understanding of pHdependent effects in proteins and other biomolecules is still very limited, particularly in big macromolecular complexes such as protein-protein or membrane protein systems. Over the years, several classes of methods have been developed to provide molecular insights into the protonation preference and dependence of biomolecules. Empirical methods offer cheap and competitive predictions for time- or resource-constrained situations. Albeit more computationally expensive, continuum electrostatics-based are a cost-effective solution for estimating microscopic equilibrium constants, pKhalf and macroscopic pKa. To study pH-dependent conformational transitions, constant-pH molecular dynamics (CpHMD) is the appropriate methodology. Unfortunately, given the computational cost and, in many cases, the difficulty associated with using CE-based and CpHMD, most researchers overuse empirical methods or neglect the effect of pH in their studies. Here, we address these issues by proposing multiple pKa predictor methods and tools with different levels of theory designed to be faster and accessible to more users. First, we introduced PypKa, a flexible tool to predict Poisson–Boltzmann/Monte Carlo-based (PB/MC) pKa values of titratable sites in proteins. It was validated with a large set of experimental values exhibiting a competitive performance. PypKa supports CPU parallel computing and can be used directly on proteins obtained from the Protein Data Bank (PDB) repository or molecular dynamics (MD) simulations. A simple, reusable, and extensible Python API is provided, allowing pKa calculations to be easily added to existing protocols with a few extra lines of code. This capability was exploited in the development of PypKa-MD, an easy-to-use implementation of the stochastic titration CpHMD method. PypKa-MD supports GROMOS and CHARMM force fields, as well as modern versions of GROMACS. Using PypKa’s API and consequent abstraction of PB/MC contributed to its greatly simplified modular architecture that will serve as the foundation for future developments. The new implementation was validated on alanine-based tetrapeptides with closely interacting titratable residues and four commonly used benchmark proteins, displaying highly similar and correlated pKa predictions compared to a previously validated implementation. Like most structural-based computational studies, the majority of pKa calculations are performed on experimental structures deposited in the PDB. Furthermore, there is an ever-growing imbalance between scarce experimental pKa values and the increasingly higher number of resolved structures. To save countless hours and resources that would be spent on repeated calculations, we have released pKPDB, a database of over 12M theoretical pKa values obtained by running PypKa over 120k protein structures from the PDB. The precomputed pKa estimations can be retrieved instantaneously via our web application, the PypKa Server. In case the protein of interest is not in the pKPDB, the user may easily run PypKa in the cloud either by uploading a custom structure or submitting an identifier code from the PBD or UniProtKB. It is also possible to use the server to get structures with representative pH-dependent protonation states to be used in other computational methods such as molecular dynamics. The advent of artificial intelligence in biological sciences presented an opportunity to drastically accelerate pKa predictors using our previously generated database of pKa values. With pKAI, we introduced the first deep learning-based predictor of pKa shifts in proteins trained on continuum electrostatics data. By combining a reasonable understanding of the underlying physics, an accuracy comparable to that of physics-based methods, and inference time speedups of more than 1000 ×, pKAI provided a game-changing solution for fast estimations of macroscopic pKa from ensembles of microscopic values. However, several limitations needed to be addressed before its integration within the CpHMD framework as a replacement for PypKa. Hence, we proposed a new graph neural network for protein pKa predictions suitable for CpHMD, pKAI-MD. This model estimates pH-independent energies to be used in a Monte Carlo routine to sample representative microscopic protonation states. While developing the new model, we explored different graph representations of proteins using multiple electrostatics-driven properties. While there are certainly many new features to be introduced and a multitude of development to be expanded, the selection of methods and tools presented in this work poses a significant improvement over the alternatives and effectively constitutes a new generation of user-friendly and machine learning-accelerated methods for pKa calculations.
id RCAP_a8f9ff6a06761f7ddfd4e36f2c0cdaad
oai_identifier_str oai:repositorio.ulisboa.pt:10451/59848
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculationsPKaProtonaçãoPh constanteMachine learningProtonationConstant-pHDomínio/Área Científica::Ciências Naturais::Ciências BiológicasThe ability to sense and react to external and internal pH changes is a survival requirement for any cell. pH homeostasis is tightly regulated, and even minor disruptions can severely impact cell metabolism, function, and survival. The pH dependence of proteins can be attributed to only 7 out of the 20 canonical amino acids, the titratable amino acids that can exchange protons with water in the usual 0-14 pH range. These amino acids make up for approximately 31% of all amino acids in the human proteome, meaning that, on average, roughly one-third of each protein is sensitive not only to the medium pH but also to alterations in the electrostatics of its surroundings. Unsurprisingly, protonation switches have been associated with a wide array of protein behaviors, including modulating the binding affinity in protein-protein, protein-ligand, or protein-lipid systems, modifying enzymatic activity and function, and even altering their stability and subcellular location. Despite its importance, our molecular understanding of pHdependent effects in proteins and other biomolecules is still very limited, particularly in big macromolecular complexes such as protein-protein or membrane protein systems. Over the years, several classes of methods have been developed to provide molecular insights into the protonation preference and dependence of biomolecules. Empirical methods offer cheap and competitive predictions for time- or resource-constrained situations. Albeit more computationally expensive, continuum electrostatics-based are a cost-effective solution for estimating microscopic equilibrium constants, pKhalf and macroscopic pKa. To study pH-dependent conformational transitions, constant-pH molecular dynamics (CpHMD) is the appropriate methodology. Unfortunately, given the computational cost and, in many cases, the difficulty associated with using CE-based and CpHMD, most researchers overuse empirical methods or neglect the effect of pH in their studies. Here, we address these issues by proposing multiple pKa predictor methods and tools with different levels of theory designed to be faster and accessible to more users. First, we introduced PypKa, a flexible tool to predict Poisson–Boltzmann/Monte Carlo-based (PB/MC) pKa values of titratable sites in proteins. It was validated with a large set of experimental values exhibiting a competitive performance. PypKa supports CPU parallel computing and can be used directly on proteins obtained from the Protein Data Bank (PDB) repository or molecular dynamics (MD) simulations. A simple, reusable, and extensible Python API is provided, allowing pKa calculations to be easily added to existing protocols with a few extra lines of code. This capability was exploited in the development of PypKa-MD, an easy-to-use implementation of the stochastic titration CpHMD method. PypKa-MD supports GROMOS and CHARMM force fields, as well as modern versions of GROMACS. Using PypKa’s API and consequent abstraction of PB/MC contributed to its greatly simplified modular architecture that will serve as the foundation for future developments. The new implementation was validated on alanine-based tetrapeptides with closely interacting titratable residues and four commonly used benchmark proteins, displaying highly similar and correlated pKa predictions compared to a previously validated implementation. Like most structural-based computational studies, the majority of pKa calculations are performed on experimental structures deposited in the PDB. Furthermore, there is an ever-growing imbalance between scarce experimental pKa values and the increasingly higher number of resolved structures. To save countless hours and resources that would be spent on repeated calculations, we have released pKPDB, a database of over 12M theoretical pKa values obtained by running PypKa over 120k protein structures from the PDB. The precomputed pKa estimations can be retrieved instantaneously via our web application, the PypKa Server. In case the protein of interest is not in the pKPDB, the user may easily run PypKa in the cloud either by uploading a custom structure or submitting an identifier code from the PBD or UniProtKB. It is also possible to use the server to get structures with representative pH-dependent protonation states to be used in other computational methods such as molecular dynamics. The advent of artificial intelligence in biological sciences presented an opportunity to drastically accelerate pKa predictors using our previously generated database of pKa values. With pKAI, we introduced the first deep learning-based predictor of pKa shifts in proteins trained on continuum electrostatics data. By combining a reasonable understanding of the underlying physics, an accuracy comparable to that of physics-based methods, and inference time speedups of more than 1000 ×, pKAI provided a game-changing solution for fast estimations of macroscopic pKa from ensembles of microscopic values. However, several limitations needed to be addressed before its integration within the CpHMD framework as a replacement for PypKa. Hence, we proposed a new graph neural network for protein pKa predictions suitable for CpHMD, pKAI-MD. This model estimates pH-independent energies to be used in a Monte Carlo routine to sample representative microscopic protonation states. While developing the new model, we explored different graph representations of proteins using multiple electrostatics-driven properties. While there are certainly many new features to be introduced and a multitude of development to be expanded, the selection of methods and tools presented in this work poses a significant improvement over the alternatives and effectively constitutes a new generation of user-friendly and machine learning-accelerated methods for pKa calculations.Machuqueiro, Miguel Ângelo dos SantosViçosa, Diogo Ruivo dos Santos VilaRocchia, WalterRepositório da Universidade de LisboaReis, Pedro B P S2023-10-17T14:20:00Z2023-052022-102023-05-01T00:00:00Zdoctoral thesisinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/10451/59848TID:101661797enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-17T15:01:06Zoai:repositorio.ulisboa.pt:10451/59848Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T03:31:43.890196Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations
title A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations
spellingShingle A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations
Reis, Pedro B P S
PKa
Protonação
Ph constante
Machine learning
Protonation
Constant-pH
Domínio/Área Científica::Ciências Naturais::Ciências Biológicas
title_short A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations
title_full A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations
title_fullStr A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations
title_full_unstemmed A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations
title_sort A new generation of user-friendly and machine learning-accelerated methods for protein pKa calculations
author Reis, Pedro B P S
author_facet Reis, Pedro B P S
author_role author
dc.contributor.none.fl_str_mv Machuqueiro, Miguel Ângelo dos Santos
Viçosa, Diogo Ruivo dos Santos Vila
Rocchia, Walter
Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Reis, Pedro B P S
dc.subject.por.fl_str_mv PKa
Protonação
Ph constante
Machine learning
Protonation
Constant-pH
Domínio/Área Científica::Ciências Naturais::Ciências Biológicas
topic PKa
Protonação
Ph constante
Machine learning
Protonation
Constant-pH
Domínio/Área Científica::Ciências Naturais::Ciências Biológicas
description The ability to sense and react to external and internal pH changes is a survival requirement for any cell. pH homeostasis is tightly regulated, and even minor disruptions can severely impact cell metabolism, function, and survival. The pH dependence of proteins can be attributed to only 7 out of the 20 canonical amino acids, the titratable amino acids that can exchange protons with water in the usual 0-14 pH range. These amino acids make up for approximately 31% of all amino acids in the human proteome, meaning that, on average, roughly one-third of each protein is sensitive not only to the medium pH but also to alterations in the electrostatics of its surroundings. Unsurprisingly, protonation switches have been associated with a wide array of protein behaviors, including modulating the binding affinity in protein-protein, protein-ligand, or protein-lipid systems, modifying enzymatic activity and function, and even altering their stability and subcellular location. Despite its importance, our molecular understanding of pHdependent effects in proteins and other biomolecules is still very limited, particularly in big macromolecular complexes such as protein-protein or membrane protein systems. Over the years, several classes of methods have been developed to provide molecular insights into the protonation preference and dependence of biomolecules. Empirical methods offer cheap and competitive predictions for time- or resource-constrained situations. Albeit more computationally expensive, continuum electrostatics-based are a cost-effective solution for estimating microscopic equilibrium constants, pKhalf and macroscopic pKa. To study pH-dependent conformational transitions, constant-pH molecular dynamics (CpHMD) is the appropriate methodology. Unfortunately, given the computational cost and, in many cases, the difficulty associated with using CE-based and CpHMD, most researchers overuse empirical methods or neglect the effect of pH in their studies. Here, we address these issues by proposing multiple pKa predictor methods and tools with different levels of theory designed to be faster and accessible to more users. First, we introduced PypKa, a flexible tool to predict Poisson–Boltzmann/Monte Carlo-based (PB/MC) pKa values of titratable sites in proteins. It was validated with a large set of experimental values exhibiting a competitive performance. PypKa supports CPU parallel computing and can be used directly on proteins obtained from the Protein Data Bank (PDB) repository or molecular dynamics (MD) simulations. A simple, reusable, and extensible Python API is provided, allowing pKa calculations to be easily added to existing protocols with a few extra lines of code. This capability was exploited in the development of PypKa-MD, an easy-to-use implementation of the stochastic titration CpHMD method. PypKa-MD supports GROMOS and CHARMM force fields, as well as modern versions of GROMACS. Using PypKa’s API and consequent abstraction of PB/MC contributed to its greatly simplified modular architecture that will serve as the foundation for future developments. The new implementation was validated on alanine-based tetrapeptides with closely interacting titratable residues and four commonly used benchmark proteins, displaying highly similar and correlated pKa predictions compared to a previously validated implementation. Like most structural-based computational studies, the majority of pKa calculations are performed on experimental structures deposited in the PDB. Furthermore, there is an ever-growing imbalance between scarce experimental pKa values and the increasingly higher number of resolved structures. To save countless hours and resources that would be spent on repeated calculations, we have released pKPDB, a database of over 12M theoretical pKa values obtained by running PypKa over 120k protein structures from the PDB. The precomputed pKa estimations can be retrieved instantaneously via our web application, the PypKa Server. In case the protein of interest is not in the pKPDB, the user may easily run PypKa in the cloud either by uploading a custom structure or submitting an identifier code from the PBD or UniProtKB. It is also possible to use the server to get structures with representative pH-dependent protonation states to be used in other computational methods such as molecular dynamics. The advent of artificial intelligence in biological sciences presented an opportunity to drastically accelerate pKa predictors using our previously generated database of pKa values. With pKAI, we introduced the first deep learning-based predictor of pKa shifts in proteins trained on continuum electrostatics data. By combining a reasonable understanding of the underlying physics, an accuracy comparable to that of physics-based methods, and inference time speedups of more than 1000 ×, pKAI provided a game-changing solution for fast estimations of macroscopic pKa from ensembles of microscopic values. However, several limitations needed to be addressed before its integration within the CpHMD framework as a replacement for PypKa. Hence, we proposed a new graph neural network for protein pKa predictions suitable for CpHMD, pKAI-MD. This model estimates pH-independent energies to be used in a Monte Carlo routine to sample representative microscopic protonation states. While developing the new model, we explored different graph representations of proteins using multiple electrostatics-driven properties. While there are certainly many new features to be introduced and a multitude of development to be expanded, the selection of methods and tools presented in this work poses a significant improvement over the alternatives and effectively constitutes a new generation of user-friendly and machine learning-accelerated methods for pKa calculations.
publishDate 2022
dc.date.none.fl_str_mv 2022-10
2023-10-17T14:20:00Z
2023-05
2023-05-01T00:00:00Z
dc.type.driver.fl_str_mv doctoral thesis
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10451/59848
TID:101661797
url http://hdl.handle.net/10451/59848
identifier_str_mv TID:101661797
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833601732565794816