More than different pKa values have been reported for the ionizable groups [ 15 ]. Most of the algorithms use nine-parameter model seven pKa values corresponding to charged amino acids and two for the terminal groups , but more advanced algorithms also exist, e.
Additionally, some models were not complete, for instance Grimsley et al. Similarly, Dawson model did not include the charge of terminal groups. Therefore, the missing values were introduced by taking average from other pKa values or similar sets in order to improve the results the models with less than nine parameters always performed worse than those having at least all nine parameters, see for instance the results for Patrickios, six-parameter model.
The aim of the present study was to derive computationally more accurate pKa sets using currently available data. For training and validation, the following datasets were used:. The IPC peptide pKa set was optimized using peptides from three, high-throughput experiments:. PHENYX dataset 7, peptides [ 4 ] — peptides from Drosophila Kc cell line fractionated using isoelectric focusing on off-gel electrophoresis device. SEQUEST dataset 7, peptides [ 4 ] — peptides from Drosophila Kc cell line fractionated using isoelectric focusing on off-gel electrophoresis device.
PIP-DB 4, entries [ 20 ] — based on literature data, provide pI and sequence information for about half of the records for details see Table 5. First, the raw data from the individual datasets was parsed to the unified fasta format with information about the isoelectric point stored in the headers. The data was carefully validated, e. None information about experimental methods used for obtaining isoelectric points or their specificity was used implicitly during this study.
This step also removed duplicates multiple entries assigned to the same sequence coming from two different databases. Detailed statistics for the datasets can be found in Table 5. As noted before, the isoelectric point is determined by iteratively calculating the sum of Eqs. The calculation can be performed exhaustively, but this would not be practical.
Instead, the bisection algorithm [ 25 ] is used, which in each iteration halves the search space initially, the pH is set to 7 and then moves higher or lower by 3. In the next iteration, the pH is changed by 1. This process is repeated until the algorithm reaches the desired precision. Bisection improves the speed by 3—4 orders of magnitude, and after approximately a dozen of iterations, the algorithm converges with 0.
Next, the speed improvement can be obtained by starting the search from a rough approximation of the solution rather than 7 in this case, a pH of 6. To measure the performance, two metrics were used i. To remove potential outliers, for the protein datasets, an MSE of three was used, and for peptide datasets, an MSE of 0. Moreover, for the preliminary analysis, the Pearson correlation was used.
The cost function was defined as the root-mean-square deviation RMSD between the true isoelectric points from the available datasets and those calculated using the new pKa set s. Optimization was performed using a basin-hopping procedure [ 10 ] which uses a standard Monte Carlo algorithm with Metropolis criterion to decide whether to accept a new solution. The previously published pKa values were used as the initial seeds.
To limit the search space, a truncated Newton algorithm [ 27 ] was used, with 2 pH unit bounds for the pKa variables e. The optimization was run iteratively multiple times using intermediate pKa sets until the algorithm converged and no better solutions could be found. During training, nested fold cross-validation was used [ 28 ]. Thus, the IPC was optimized separately on k-1 partitions and tested on the remaining partition. The training was repeated ten times in all combinations.
The resulting pKa sets were averaged. In general, this process resulted in slower convergence of the algorithm and a longer training time but prevented overfitting.
Apart from the nine-parameter model nine pKa values for charged residues also more advanced models similar to Bjellqvist and ProMoST were also tested.
Their performance was on a similar level thus the simpler, nine-parameter model was used in the final version of IPC. Moreover, IPC can be used on any operating system as a standalone program written in Python language Additional file 4. The scatter plot with the predicted isoelectric points versus molecular weight for all proteins is presented at the top. Then, for individual proteins, pI predictions based on different pKa sets are presented alongside the molecular weight and amino acid composition.
The author reviews the state of the art in the pI computation from protein sequence, provides an improved software tool and presents a WWW site with lots of related information, a WWW server and the software download. The manuscript describes a novel set of pKa values for peptides and proteins. The set can be used to estimate the isoelectric point of these macromolecules.
I have made a concerted effort to address all of his concerns. It is really interesting that the prediction works better for prokaryotic than for eukaryotic proteins. Can the author perform a bit more detailed analysis on this topic besides pointing out the role of PTMs? Do the worst outliers exhibit characteristic amino acid distributions? Additionally, I have chosen randomly selected sequences from non-outliers, to be sure that sample size does not matter. Conclusions: the outliers are usually shorter and slightly more disordered, but this is not statistically significant.
Similarly secondary structure composition and charged amino acid frequencies are very similar. As the result of this analysis does not introduce any new, unexpected information, I added it only to the supplement and mentioned briefly in methods section, where I defined outliers lines — It could be interesting if the author could give any further insights into the variations of the pKa values in the sets and especially the divergence of the newly suggested values relative to those in the literature.
There is already a discussion of this in the manuscript just before the Conclusions section but as it is both an important and an interesting aspect, the manuscript might benefit from a more detailed analysis of this question. Additionally, it would not improve the flow of the manuscript this is rather off topic. Nevertheless, I also think that it is interesting aspect, thus I added this information to Additional file 1 : Table S3 underscoring the most divergent values and briefly discussing it possible source.
For proteins, the information about the experimental technique and the organism is available only partially see e. In any case, those data were not used directly during the datasets construction or optimization not to favor any technique or an organism.
For instance, for protein dataset most proteins comes from eukaryotic organisms, sequences versus sequences coming from Prokaryotes. More detailed data about organism distribution can be seen on the pie plots in the supplement Additional file 1 : Figure S1. In the nutshell, most of the protein sequences come from human, E. I think that current, brief description the Reviewer can find in lines — is sufficient and more detailed descriptions of the methods from the original studies is out of the scope of presented manuscript and would extend the manuscript unnecessarily with minor benefit for the Readers.
The author states that when multiple data were available for the isoelectric point, the average was taken. It would be nice to know how divergent these data were and whether the author has any hints on whether this affects the performance in any detectable way. Moreover, it can be noticed that reported molecular weights To sum up this part of the comment, the primary databases used for the construction of protein and peptide datasets have different quality.
They may contain multiple annotation errors, but the only possible thing I could do in high-throughput and automatic way is to minimize the effect of this noise see for instance Table 3 by averaging the multiple measurements and removing the obvious errors identified by comparison of experimental and theoretical pI. Table 3 shows that the datasets used have strong influence on the accuracy of the method per value , but in most cases the order of the methods stay the same or is very similar which indicates that even in the noisy data the methods are capable to detect signal.
In the fold cross-validation process, how divergent were the resulting pKa sets that were averaged? What is the relation of this divergence to the diversity in the other data sets? There are many possible 9 sets of pKa values which produce only slightly worse results.
Therefore, the optimization was run 2, times to allow for exploring the search space in the different places and the local minimum was refined by bashing-hopping. Please provide a short explanation in the Methods section of the asterisked comments for Table 4 and Additional file 1 : Table S3 e. Having the initial results from Patrickios, six-parameter model it was obvious that skipping Arg or terminal charges will have detrimental effect on the performance thus I decided to add them ad hoc, these values were taken as the average from few scales or most similar scale I know at the time of doing that initially there were only 6—7 scales used, but over the years I implemented more and more scales.
The language of the manuscript needs careful revision. Most of the concepts can be deduced from the present version but the phrasing should be done with more care. So, although I think that the paper can be understood in its present form, I strongly recommend extensive language editing before final publication.
I hope that the corrected version of the manuscript is better. Although, the headers could be simplified and in current version they may have different form depending from which source they come from I decided to leave them as they are even if sometimes they seems to be hard to understand immediately as it is easy to check the correctness of the parsing in comparison to original files.
High resolution two-dimensional electrophoresis of proteins. J Biol Chem. Klose J. Protein mapping by combined isoelectric focusing and electrophoresis of mouse tissues.
Prefractionation techniques in proteome analysis. Added value for tandem mass spectrometry shotgun proteomics data validation through isoelectric focusing of peptides. J Proteome Res. Protein ionizable groups: pK values and their contribution to protein stability and solubility. J Chem Educ. Article Google Scholar.
Isoelectric point optimization using peptide descriptors and support vector machines. J Proteome. Accurate estimation of isoelectric point of protein and peptide based on amino acid sequences. Global optimization by basin-hopping and the lowest energy structures of Lennard-Jones clusters containing up to atoms.
J Phys Chem A. The relationships between the isoelectric point and: length of proteins, taxonomy and ecology of organisms. BMC Genomics. The modal distribution of protein isoelectric points reflects amino acid properties rather than sequence evolution. Oren A. Microbial life at high salt concentrations: phylogenetic and metabolic diversity.
Saline Syst. Using isoelectric point to determine the pH for initial protein crystallization trials. A summary of the measured pK values of the ionizable groups in folded proteins.
Protein Sci. Nucleic Acids Res. A versatile peptide pI calculator for phosphorylated and N-terminal acetylated peptides experimentally tested using peptide isoelectric focusing. PIP-DB: the protein isoelectric point database.
The UniProt Consortium. UniProt: a hub for protein information. RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Approximate Volume: A 3. Three letter code:. Assumptions: The primary assumption is the spectral contributions of the tyrosine, tryptophan and cystine at nm do not differ significantly in the native form of the protein, relative to the denatured form.
The calculation is based on the Edlehoch model in which proteins are examined in a 6M guanidinium hydrochloride Gdn-HCl denaturing solution which allowed for matching of native to denatured forms.
Support Answers MathWorks. Search MathWorks. MathWorks Answers Support. Close Mobile Search. Trial software. You are now following this question You will see updates in your followed content feed. You may receive emails, depending on your communication preferences. Show older comments. Britnie Casillas on 3 Oct Vote 0.
Commented: Britnie Casillas on 3 Oct Accepted Answer: John D'Errico. My function:. I have no idea what I am doing wrong. Walter Roberson on 3 Oct Cancel Copy to Clipboard. What are you expecting input to do for you?
0コメント