The errors that can be introduced into a protein model during model building and refinement vary tremendously in their importance and severity [1Kleywegt G.J. Jones T.A. Where freedom is given, liberties are taken.Structure. 1995; 3 (96173002): 535-540Abstract Full Text Full Text PDF PubMed Scopus (219) Google Scholar, 2Brändén C.I. Jones T.A. Between objectivity and subjectivity.Nature. 1990; 343: 687-689Crossref Scopus (245) Google Scholar]. At one extreme, the mainchain may be totally incorrectly traced in the experimental map, or the molecular replacement solution may be wrong. Minor errors may include an incorrect peptide orientation, or misplaced or excessive water molecules. The reasons why errors creep into models are many, but for a structure built into an experimental map, the main ones are limited resolution and poorly phased diffraction data. Other things being equal, the resolution of the diffraction data should be the ultimate variable that determines the accuracy of a structural investigation. Inevitably, life is more complicated, and a successful structural investigation is often a learning experience for the people involved. To ensure the correctness of a study, the crystallographer has relied on the R factor which gives an overall measure of how well the final model fits the experimental diffraction data. The trust in this indicator for structures solved at medium and low resolution has been severely dented by a series of high profile studies where severe errors have been made and gone undetected. To supplement this single indicator, a number of new figures of merit have been suggested, of which the free R factor of Brünger [[3]Brünger A.T. Free R value: a novel statistical quantity for assessing the accuracy of crystal structures.Nature. 1992; 355: 472-475Crossref PubMed Scopus (3782) Google Scholar] is particularly useful and is increasingly being used [[4]Kleywegt G.J. Brünger A.T. Checking your imagination: applications of the free R value.Structure. 1996; 4: 897-904Abstract Full Text Full Text PDF PubMed Scopus (361) Google Scholar]. One of the more surprising results of high-resolution diffraction studies on proteins has been the observation that the conformational angles show preferences for (combinations of) values that are expected based on simple energy considerations. This has prompted us to rely on the use of sidechain rotamers during the initial map interpretation stage [[5]Jones T.A. Zou J.Y. Cowan S.W. Kjeldgaard M. Improved methods for building protein models in electron density maps and the location of errors in these models.Acta Cryst. A. 1991; 47 (91222453): 110-119Crossref PubMed Scopus (12954) Google Scholar] and during refinement at low resolution. Deviations from the preferred conformations can then be used as indicators of potential error. It must be emphasized, however, that these are merely potential error indicators and they must be carefully evaluated with the experimental information that is available to the crystallographer. Due to steric hindrance, the mainchain of a polypeptide usually assumes preferred, energetically favourable conformations [[6]Ramakrishnan C. Ramachandran G.N. Stereochemical criteria for polypeptide and protein chain conformations. II. Allowed conformations for a pair of peptide units.Biophys. J. 1965; 5: 909-933Abstract Full Text PDF PubMed Scopus (684) Google Scholar]. For each residue, these conformations can be characterized by the value of two torsion angles, φ and ψ (the third angle, ω, is largely restricted to values of 180° for trans-peptides, and 0° for cis-peptides). The φ angle of residue i is defined by the torsion Ci−1-Ni-Cαi-Ci, and ψ by the torsion Ni-Cαi-Ci-Ni+1. The distribution of φ and ψ is usually called the Ramachandran plot. More than ten years ago, such plots were used to remove two structures from a database of high-resolution structures that had been created for model building in experimental maps [[7]Jones T.A. Thirup S. Using known substructures in protein model building and crystallography.EMBO J. 1986; 5 (86220164): 819-822Crossref PubMed Scopus (712) Google Scholar] (TAJ, unpublished results). Both structures have since been shown to contain severe errors. The Ramachandran plot will clearly show how well the φ and ψ angles cluster and will reveal other oddities that may be the result of errors made during refinement. Unfortunately, many scientific magazines consider such a plot to be too technical for their readership, who are more interested in biological relevance and beautiful pictures. In our experience, the Ramachandran plot is one of the simplest and most sensitive means for assessing the quality of a protein model in the absence of experimental data. The major reason for this is probably the fact that the φ, ψ angles (or combinations of these) are not usually restrained during X-ray refinement, as opposed to bond lengths and bond angles, for instance. Therefore, an indicator that shows how much a particular structure deviates from the preferred areas of a Ramachandran plot is, we believe, a requirement for assessing the quality of a protein model. With the advent of the program ProCheck [[8]Laskowski R.A. MacArthur M.W. Moss D.S. Thronton J.M. PROCHECK: a program to check the stereochemical quality of protein structures.J. Appl. Cryst. 1993; 26: 283-291Crossref Google Scholar], Ramachandran plots have gained somewhat in popularity. ProCheck divides the Ramachandran plot into four types of area: most favoured, additional allowed, generously allowed and disallowed. A typical good model should not only have very few residues within the disallowed regions, but also very many in the most favoured regions. Unfortunately, the division into four regions has given rise to confusion when it comes to reporting the quality of the Ramachandran plot. Many authors only quote the number or percentage of residues in disallowed regions, others quote only those in the most favoured regions. Even more difficult to interpret is a phrase such as, '∼80% of the residues were found to lie in allowed regions according to ProCheck'. This phrase may describe a high-quality model (where the authors meant to say 'most favoured') but can equally well be used to describe a very poor model (if the authors meant to say '∼20% in disallowed regions'). For instance, the backwardstraced model of cellular retinoic acid-binding protein which we described earlier [[1]Kleywegt G.J. Jones T.A. Where freedom is given, liberties are taken.Structure. 1995; 3 (96173002): 535-540Abstract Full Text Full Text PDF PubMed Scopus (219) Google Scholar], has 8.9% of its residues in disallowed regions, and only 42.7% in the most favoured regions. Nevertheless, an unscrupulous crystallographer could report this as '91% of the residues lie in allowed regions of the Ramachandran plot'. This problem was also recently noted by Karplus who, in an independent study, found that 'much of conformational space designated as allowed and generously allowed, and even some of the core region is very rarely (or not at all) observed' [[9]Karplus P.A. Experimentally observed conformation-dependent geometry and hidden strain in proteins.Protein Sci. 1996; 5: 1406-1420Crossref PubMed Scopus (210) Google Scholar]. In order to remedy this problem, we have carried out an analysis of high-resolution protein structures (see the Methods section). This has resulted in a division of the Ramachandran plot into two areas: core and non-core. The core regions consist of the most populated 10° by 10° areas which together account for 98% of all non-glycine residues in our sample (Figure 1); together they occupy only 19.7% of the entire plot area. By having a binary classification scheme, ambiguities concerning allowed regions are avoided. Also, we have chosen to include proline residues in the analysis, as the most populated areas of the Ramachandran plot for these residues are not outside the areas found for all other non-glycine residues (data not shown). Figure 2 shows the relationship between resolution and the percentage of residues in non-core regions (outliers) for more than 3000 protein structures from the Protein Data Bank (PDB). Table 1 shows the distribution of the percentage of outliers for all protein X-ray structures (with full coordinates and at least 20 residues) that were in the PDB in February 1996. This shows that ∼91% of all structures have 10% or fewer outliers. Only ∼4% have more than 15% outliers, and ∼1.5% have more than 25% outliers. However, these numbers vary a great deal as a function of time: for structures deposited between 1973–1980, 66.7% have no more than 15% outliers; for the perido 1981–1985, this number is 85.8%; for 1986–1990 it is 94.2%; and for 1991–1995 it is 97.1%. Indeed, there is a weak negative correlation between the year of deposition and the percentage of outliers (correlation coefficient −0.16).Table 1Distribution of the percentage of Ramachandran outliers in protein models in the PDB.Ramachandran outliers (%)Number of PDB entriesFraction∗ (%)0–5235476.55–1044814.610–151434.715–20602.020–25260.925–30160.530–35140.535–4070.240–4540.145–5010.03>5030.1∗Fraction of total number of PDB entries. Open table in a new tab ∗Fraction of total number of PDB entries. In Figure 2, it is the high-resolution structure of gramicidin [[10]Langs D.A. Three-dimensional structure at 0.86 å of the uncomplexed form of the transmembrane ion channel peptide gramicidin A.Science. 1988; 241 (88264421): 188-191Crossref PubMed Scopus (261) Google Scholar] which is responsible for the noticeable outlier. Figure 3 shows its Ramachandran plot, which has more than 60% outliers. However, this molecule (refined to 0.86 å resolution) is a small peptide, which contains both - and -amino acids. The Ramachandran plot shows that the preferred φ, ψ angles of the -amino acids are positioned symmetrically around the diagonal of the plot from those of the -amino acids. This case clearly demonstrates that outliers in a Ramachandran plot are not necessarily errors. However, it is the responsibility of the crystallographer to investigate if outliers are due to errors in the model, or if they represent unusual features of the structure. We have also looked at Ramachandran plots for protein models for which a free R value [[3]Brünger A.T. Free R value: a novel statistical quantity for assessing the accuracy of crystal structures.Nature. 1992; 355: 472-475Crossref PubMed Scopus (3782) Google Scholar] is quoted in the PDB entry. We identified 127 such entries and find that the fraction of outliers is slightly more strongly correlated with the free R value (correlation coefficient +0.57) than with the conventional R value (+0.49); however, resolution is the highest correlated factor (+0.64). Of course, one could introduce, φ, ψ restraints during refinement in order to cosmetically improve the model. We have tried this with X-PLOR [[11]Brünger A.T. X-PLOR, Version 3.1. A System for X-ray Crystallography and NMR. Yale University Press, New Haven, CT, USA1992Google Scholar] using the 3.2 å model of the complex of the Fc fragment of human IgG with the C2 domain of protein G [[12]Sauer-Eriksson A.E. Kleywegt G.J. Uhlén M. Jones T.A. Crystal structure of the C2 fragment of streptococcal protein G in complex with the Fc domain of human IgG.Structure. 1995; 3 (95308043): 265-278Abstract Full Text Full Text PDF PubMed Scopus (296) Google Scholar], which has 16% outliers. By introducing restraints for those residues which lie near any of the core regions, the fraction of outliers can be reduced to 11% even with a very low force constant (10 kcal mol−1 å −2. However, even with a force constant greater than 5000 kcal mol−1 å −2 the fraction of outliers gets only as low as 9%, and this is then at the expense of an increase of both the conventional and the free R values. It therefore appears to be rather difficult to 'fudge' the indicator. Ramachandran plots can be used to monitor the progress of the refinement and rebuilding of a protein model. For instance, Figure 4 shows the Ramachandran plots for various models of cellobiohydrolase I as the structure was built and refined [[13]Divne C. Jones T.A. et al.The three-dimensional crystal structure of the catalytic core of cellobiohydrolase I from Trichoderma reesei.Science. 1994; 265 (94310436): 524-528Crossref PubMed Scopus (537) Google Scholar]. Table 2 shows how the Ramachandran outlier indicator improved as the model improved (C Divne, personal communication).Table 2Concomitant improvement of model quality and Ramachandran plot during the refinement of cellobiohydrolase I.Model numberR factor (%)Ramachandran outliers (%)∗Rmsd to final model (å)A149.932.70.715A229.36.90.314A331.82.60.236A425.01.70.165A920.60.50.084A1618.10.70.002∗Root mean square deviation. Open table in a new tab ∗Root mean square deviation. The Ramachandran outlier indicator described here is, we believe, a very useful measure as to how well the structure fits the expected mainchain torsion angle distribution. Nevertheless, a plot still contains a lot more information than a single number. This is clearly illustrated by the gramicidin example above, and by other examples (e.g. one case where the crystallographers appear to have gone to some extremes to reduce the number of residues with positive φ angles, is shown in Figure 5). Sometimes macromolecules crystallize with more than one independent copy of the molecule inside the crystallographic asymmetric unit. In such a case of non-crystallographic symmetry (NCS), a simple modification of the Ramachandran plot instantly turns it into a means to visualize the difference in the backbone torsion angles for corresponding residues in the various NCS-related molecules [[14]Kleywegt G.J. Use of non-crystallographic symmetry in protein structure refinement.Acta Cryst. D. 1996; 52: 842-857Crossref PubMed Scopus (506) Google Scholar]. The modification entails a simple calculation of the centroid φ, ψ angles for each set of NCS-related residues, and connecting the points in the Ramachandran plot to this centroid. Normally, one would expect most residues to cluster fairly tightly, although some clusters with larger spread may occur, for example in hinge regions [[14]Kleywegt G.J. Use of non-crystallographic symmetry in protein structure refinement.Acta Cryst. D. 1996; 52: 842-857Crossref PubMed Scopus (506) Google Scholar]. However, if one finds that most or all clusters show severe scatter, one might want to introduce NCS-restraints in further refinement to avoid artefactual differences between the NCS-related molecules. There are a number of cases known of structures containing NCS that have been deposited and show large numbers of Ramachandran outliers as well as large differences between the mainchain torsion angles of NCS-related residues. Models with more than 15% outliers should be regarded with caution; the depositing authors should probably try to correct these. If NCS is involved, we suggest that more care should be taken during refinement to prevent the introduction of artefacts [1Kleywegt G.J. Jones T.A. Where freedom is given, liberties are taken.Structure. 1995; 3 (96173002): 535-540Abstract Full Text Full Text PDF PubMed Scopus (219) Google Scholar, 14Kleywegt G.J. Use of non-crystallographic symmetry in protein structure refinement.Acta Cryst. D. 1996; 52: 842-857Crossref PubMed Scopus (506) Google Scholar]. Some structures with NCS display genuine conformational differences, for example as global domain motions. This is illustrated by a new structure of ligand-free ribose-binding protein, in which there are two molecules in the asymmetric unit. They differ by a domain rotation of ∼12° in one molecule relative to the other (SL Mowbray, personal communication); Figure 6 shows a Ramachandran plot of these structures. Overall, there are only 1% outliers and the vast majority of the mainchain torsion angles are similar in both models. However, particularly in the hinge region, there are some real differences that manifest themselves as longer connecting lines that together generate the domain rotation. We used the list of Hobohm and Sander [[15]Hobohm U. Sander C. Enlarged representative set of protein structures.Protein Sci. 1994; 3 (94290327): 522-524Crossref PubMed Scopus (714) Google Scholar], of August 1995, and the PDB [[16]Bernstein F.C. Tasumi M. et al.The Protein Data Bank: a computer-based archival file for macromolecular structures.J. Mol. Biol. 1977; 112: 535-542Crossref PubMed Scopus (8059) Google Scholar] release of October 1995, to create a set of 403 protein models. These models had no more than 95% sequence identity, contained more than 20 amino acid residues, and had been solved by X-ray crystallography at a resolution better than, or equal to, 2.0 å. For each model, all atoms (and their associated torsion angles) whose temperature factor was higher than the average protein temperature factor plus two standard deviations were discarded. This was done in order to exclude residues from the analysis whose conformation might have been determined more by the restraints or force field used in the refinement than by actual experimental data. For the Ramachandran analysis, the plot was divided into squares of 10° by 10°, and the φ, ψ combinations in each square were tallied for 81782 residues. Although the distributions are different for different residue types, here we only discuss the statistics pertaining to all (74893) nonglycine residues. The distribution of φ, ψ values for these residues is shown in Figure 1. Note that the area commonly associated with β structure actually contains two maxima. The outer contour line delineates the most populated areas which together account for 98% of all non-glycine residues. For an average X-ray model determined at a resolution of 2.0 å or better, one would expect ∼0–5% of the non-glycine residues to lie outside the shaded areas (an estimate determined by analyzing all protein models in the PDB solved at a resolution of 2.0 å or better). The average fraction of outliers for all structures (i.e. at all resolutions) is 4% (σ 5%). We have implemented this new definition of core regions in all our programs that use or produce Ramachandran plots, including O [[5]Jones T.A. Zou J.Y. Cowan S.W. Kjeldgaard M. Improved methods for building protein models in electron density maps and the location of errors in these models.Acta Cryst. A. 1991; 47 (91222453): 110-119Crossref PubMed Scopus (12954) Google Scholar], OOPS [[17]Kleywegt G.J. Jones T.A. Efficient rebuilding of protein structures.Acta Cryst. D. 1996; 52: 829-832Crossref PubMed Scopus (163) Google Scholar], LSQMAN [[14]Kleywegt G.J. Use of non-crystallographic symmetry in protein structure refinement.Acta Cryst. D. 1996; 52: 842-857Crossref PubMed Scopus (506) Google Scholar] and MOLEMAN2 (GJK, unpublished program). A list of outlier percentages of more than 3000 proteins from the February 1996 release of the PDB is available on the World Wide Web (http://alpha2.bmc.uu.se/∼gerard/rama/rama.html). This site also contains the 37 by 37 matrix of residue counts, as well as a Fortran subroutine that implements our core region definition. We would like to thank Dr C Divne for providing us with her CBH I models, and Dr SL Mowbray for allowing access to the ribose-binding protein model prior to publication. This work was supported by the Swedish Natural Science Research Council, Uppsala University and the European Union (grant number BIO4-CT96-0189 to TAJ). GJ Kleywegt and TA Jones, Department of Molecular Biology, Biomedical Centre, Uppsala University, Box 590, S-751 24 Uppsala, Sweden. E-mail address for TA Jones (corresponding author): [email protected]