We developed a new computational algorithm for the accurate identification of ligand binding envelopes rather than surface binding sites. We performed a large scale classification of the identified envelopes according to their shape and physicochemical properties. The predicting algorithm, called PocketFinder, uses a transformation of the Lennard-Jones potential calculated from a three-dimensional protein structure and does not require any knowledge about a potential ligand molecule. We validated this algorithm using two systematically collected data sets of ligand binding pockets from complexed (bound) and uncomplexed (apo) structures from the Protein Data Bank, 5616 and 11,510, respectively. As many as 96.8% of experimental binding sites were predicted at better than 50% overlap level. Furthermore 95.0% of the asserted sites from the apo receptors were predicted at the same level. We demonstrate that conformational differences between the apo and bound pockets do not dramatically affect the prediction results. The algorithm can be used to predict ligand binding pockets of uncharacterized protein structures, suggest new allosteric pockets, evaluate feasibility of protein-protein interaction inhibition, and prioritize molecular targets. Finally the data base of the known and predicted binding pockets for the human proteome structures, the human pocketome, was collected and classified. The pocketome can be used for rapid evaluation of possible binding partners of a given chemical compound. We developed a new computational algorithm for the accurate identification of ligand binding envelopes rather than surface binding sites. We performed a large scale classification of the identified envelopes according to their shape and physicochemical properties. The predicting algorithm, called PocketFinder, uses a transformation of the Lennard-Jones potential calculated from a three-dimensional protein structure and does not require any knowledge about a potential ligand molecule. We validated this algorithm using two systematically collected data sets of ligand binding pockets from complexed (bound) and uncomplexed (apo) structures from the Protein Data Bank, 5616 and 11,510, respectively. As many as 96.8% of experimental binding sites were predicted at better than 50% overlap level. Furthermore 95.0% of the asserted sites from the apo receptors were predicted at the same level. We demonstrate that conformational differences between the apo and bound pockets do not dramatically affect the prediction results. The algorithm can be used to predict ligand binding pockets of uncharacterized protein structures, suggest new allosteric pockets, evaluate feasibility of protein-protein interaction inhibition, and prioritize molecular targets. Finally the data base of the known and predicted binding pockets for the human proteome structures, the human pocketome, was collected and classified. The pocketome can be used for rapid evaluation of possible binding partners of a given chemical compound. Prediction of ligand binding sites is a fundamental step in the investigation of the molecular recognition mechanism and function of a protein. An increasing number of protein structures are becoming available from high throughput structural genomic projects prior to biological and functional characterization. Therefore, computational methods to predict ligand binding sites are becoming increasingly important. There are three independent sources of information that can be used to infer the location of possible ligand binding sites on the surface of a protein: (i) protein structure, (ii) evolutionary information (sequence alignments), and (iii) ligand/substrate information. A number of sophisticated algorithms using evolutionary information or algorithms predicting locations of binding sites for specific substrates have been published (1Campbell S.J. Gold N.D. Jackson R.M. Westhead D.R. Ligand binding: functional site location, similarity and docking.Curr. Opin. Struct. Biol. 2003; 13: 389-395Google Scholar, 2Lichtarge O. Yao H. Kristensen D.M. Madabushi S. Mihalek I. Accurate and scalable identification of functional sites by evolutionary tracing.J. Struct. Funct. Genomics. 2003; 4: 159-166Google Scholar, 3Lichtarge O. Sowa M.E. Evolutionary predictions of binding surfaces and interactions.Curr. Opin. Struct. Biol. 2002; 12: 21-27Google Scholar). Here we attempted to develop an algorithm that is based solely on the protein structure and without any prior knowledge about the nature of the substrate. We hypothesized that the structure itself is sufficiently informative, whereas the evolutionary conservation and the nature of the ligand can only be used as optional contributions. Proteins are involved in several kinds of molecular interactions: with other proteins, DNA, RNA, peptides, and small molecules. In this study we present an algorithm to predict the binding envelopes near potential small ligand binding sites or areas that could be targeted with small “druglike” compounds. Once the ligand binding pocket is predicted, a high throughput ligand docking procedure or structure-based drug design (4Walters W.P. Stahl M.T. Murcko M.A. Virtual screening—an overview.Drug Discov. Today. 1998; 3: 160-178Google Scholar, 5Klebe G. Recent developments in structure-based drug design.J. Mol. Med. 2000; 78: 269-281Google Scholar, 6Gane P.J. Dean P.M. Recent advances in structure-based rational drug design.Curr. Opin. Struct. Biol. 2000; 10: 401-404Google Scholar, 7Abagyan R. Totrov M. High-throughput docking for lead generation.Curr. Opin. Chem. Biol. 2001; 5: 375-382Google Scholar, 8Shoichet B.K. McGovern S.L. Wei B. Irwin J.J. Lead discovery using molecular docking.Curr. Opin. Chem. Biol. 2002; 6: 439-446Google Scholar, 9Anderson S. Chiplin J. Structural genomics: shaping the future of drug design?.Drug Discov. Today. 2002; 7: 105-107Google Scholar) can be used to generate a list of the lead molecules. The properties of druglike molecules are well studied (10Veber D.F. Johnson S.R. Cheng H.Y. Smith B.R. Ward K.W. Kopple K.D. Molecular properties that influence the oral bioavailability of drug candidates.J. Med. Chem. 2002; 45: 2615-2623Google Scholar, 11Lipinski C.A. Drug-like properties and the causes of poor solubility and poor permeability.J. Pharmacol. Toxicol. Methods. 2000; 44: 235-249Google Scholar) and cover a certain range of sizes, typically with molecular mass between 300 and 700 daltons. Therefore, we excluded from consideration very small ligands, such as metals and small solvent molecules, as well as very large substrates. However we wanted to develop an algorithm that within this size range does not depend on the nature of the ligand. A number of structure-based pocket prediction algorithms have been published over the last 10 years. They can be divided into two general classes: (i) geometric algorithms and (ii) probe mapping/docking algorithms. Geometric approaches analyze protein surfaces to find clefts. SURFNET (12Laskowski R.A. SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions.J. Mol. Graph. 1995; 13: 323-330Google Scholar) detects the gap regions in proteins by fitting spheres into the spaces between protein atoms. The sphere fitting process results in a number of separate groups of interpenetrating spheres, which correspond to the cavities and clefts of the protein. LIGSITE (13Hendlich M. Rippmann F. Barnickel G. LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins.J. Mol. Graph. Model. 1997; 15: 359-363Google Scholar), an improved version of POCKET (14Levitt D. Banaszak L. POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids.J. Mol. Graph. 1992; 10: 229-234Google Scholar), identifies clefts by putting the protein in a regular Cartesian grid and scanning along the x, y, and z axes and the cubic diagonals for areas that are enclosed on both sides by protein. APROPOS (15Peters K.P. Fauck J. Frommel C. The automatic search for ligand binding sites in proteins of known three-dimensional structure using only geometric criteria.J. Mol. Biol. 1996; 256: 201-213Google Scholar) and CAST (16Liang J. Edelsbrunner H. Woodward C. Anatomy of protein pockets and cavities: Measurement of binding site geometry and implications for ligand design.Protein Sci. 1998; 7: 1884-1897Google Scholar) are based on the α-shape algorithm, which identifies pockets by comparing surfaces of the protein generated with different levels of detail. PASS (17Brady Jr., G.P. Stouten P.F. Fast prediction and visualization of protein binding pockets with PASS.J. Comput.-Aided Mol. Des. 2000; 14: 383-401Google Scholar) identifies the “active site points” by coating the protein surface with a layer of spherical probes and then filtering out those that clash with the protein or are not sufficiently buried. In addition to those pure geometrical methods, some approaches based on mapping/docking and scoring of molecular fragments have been proposed (18Dennis S. Kortvelyesi T. Vajda S. Computational mapping identifies the binding sites of organic solvents on proteins.Proc. Natl. Acad. Sci. U. S. A. 2002; 99: 4290-4295Google Scholar, 19Kortvelyesi T. Silberstein M. Dennis S. Vajda S. Improved mapping of protein binding sites.J. Comput.-Aided Mol. Des. 2003; 17: 173-186Google Scholar, 20Ruppert J. Welch W. Jain A. Automatic identification and representation of protein binding sites for molecular docking.Protein Sci. 1997; 6: 524-533Google Scholar, 21Verdonk M.L. Cole J.C. Watson P. Gillet V. Willett P. SuperStar: improved knowledge-based interaction fields for protein binding sites.J. Mol. Biol. 2001; 307: 841-859Google Scholar, 22Bliznyuk A. Gready J. Simple method for locating possible ligand binding sites on protein surfaces.J. Comput. Chem. 1999; 9: 983-988Google Scholar, 23Glick M. Robinson D.D. Grant G.H. Richards W.G. Identification of ligand binding sites on proteins using a multi-scale approach.J. Am. Chem. Soc. 2002; 124: 2337-2344Google Scholar). Two excellent recent reviews of computational tools for identification of small molecule binding sites in proteins give a good overview of the field (1Campbell S.J. Gold N.D. Jackson R.M. Westhead D.R. Ligand binding: functional site location, similarity and docking.Curr. Opin. Struct. Biol. 2003; 13: 389-395Google Scholar, 24Sotriffer C. Klebe G. Identification and mapping of small-molecule binding sites in proteins: computational tools for structure-based drug design.Farmaco. 2002; 57: 243-251Google Scholar). Pure geometric methods are relatively straightforward, but there is no direct physical meaning behind them. On the other hand, methods using molecular fragment mapping and ligand docking are better physically justified but computationally expensive and cannot always provide a good discrimination between correct and incorrect sites. Also the previously published methods were typically tested on relatively small data sets. Although APROPS used a relatively large test set of about 300 structures, others only used 10–50 selected test cases. This is despite the fact that almost 30,000 x-ray structures have been deposited in the Protein Data Bank (25Bernstein F.C. Koetzle T.F. Williams G.J. Meyer Jr., E.E. Brice M.D. Rodgers J.R. Kennard O. Shimanouchi T. Tasumi M. The Protein Data Bank: a computer-based archival file for macromolecular structures.J. Mol. Biol. 1977; 112: 535-542Google Scholar, 26Berman H.M. Westbrook J. Feng Z. Gilliland G. Bhat T.N. Weissig H. Shindyalov I.N. Bourne P.E. The Protein Data Bank.Nucleic Acids Res. 2000; 28: 235-242Google Scholar). Finally, because the ultimate goal of the binding site prediction methods is to find active sites on uncharacterized structures, it is important to test and validate the algorithms on large sets of the “unbound” or apo structures. Only the PASS algorithm was tested on a data set of 21 apo structures; the other publications did not test the effect of induced conformational changes on the prediction accuracy. A benchmark test based on a large, systematic data set of apo structures is necessary for evaluating protein-ligand binding site identification methods. In this study, we present and validate a novel algorithm for prediction of ligand binding pockets. The algorithm called PocketFinder is based on a transformation of the Lennard-Jones potential. Like pure geometric approaches, PocketFinder is fast and capable of identifying clefts and cavities regardless of the nature of the substrate while being more sensitive and specific. Furthermore, in contrast to other methods, the PocketFinder algorithm not only detects the location of the binding pocket but also predicts envelopes representing the shape and size of putative ligand binding volume. The method was tested on a systematically collected data set 2 orders of magnitude larger than previous benchmarks: 5,616 binding sites collected from ligand-protein complexes and 11,510 apo binding sites inferred from the complexes by homology. All small molecule binding envelopes from the human structural proteome were collected and clustered into a pocketome. The predicted binding envelopes were hierarchically clustered. The complete pocketome may be useful for understanding a complex network of interactions between small molecules and the cell proteins. All three-dimensional protein structures were taken from the October 3, 2003 Protein Data Bank release. Prior to computation, we removed all ligands and water molecules from the structure. Protein-ligand binding pockets are predicted based on the grid potential map of van der Waals interaction of the receptor. To determine the regions of consistently high van der Waals attraction, the following four-step procedure was applied. The first step was to create the grid potential map of the van der Waals force field using a probe atom (parameters for an aliphatic carbon were used) in orthogonal parallelepiped surrounding receptor atoms (27Totrov M. Abagyan R. Flexible protein-ligand docking by global energy optimization in internal coordinates.Proteins. 1997; 29: 215-220Google Scholar). The grid had 1.0-Å spacing, and a margin of 1.0 Å beyond the dimensions of the protein was added. The potential was calculated according to the Lennard-Jones formula, Pp0= ∑l=1N ( AXlCrpl12−BXlCrpl6)(Eq. 1) where rpl is the distance between the probe p placed at a grid node and the protein atom Xl. Parameters AXC and BXC are taken from the Empirical Conformational Energy Program for Peptides (ECEPP)/3 molecular mechanics force field. Pp0 values were further truncated into the min(Pp0), −0.8 range to retain only the attractive regions. The second step was to smooth (space average) the potential map to emphasize the regions with the van der Waals potential consistently low across significant space span and to avoid excessive density fragmentation. Smoothing was performed by an iterative averaging of the potential Pijkn on each grid node (i,j,k) with the adjacent nodes. Pijkn+1=(Pijkn+(Pi−1jkn+Pi+1jkn+Pij−1kn+Pij+1kn+Pijk−1n+Pijk+1n)/6)/2(Eq. 2) Iterative application of this transformation closely approximates a (more computationally expensive) convolution with an averaging function. P(r→)=∫f(ρ→, r→)P0(ρ→)dρ→; f(ρ→, r→)=e−( ρ→−r→λ)2; λ=23Niter(Eq. 3) Ten iterations were performed, corresponding to the smoothing length parameter (radius) λ of ∼2.6 Å. The third step was to create putative ligand envelopes by contouring the resulting map at a level Pcont calculated as follows, Pcont=mean(Pijk)−τ×r.m.s.d.(Pijk)(Eq. 4) where the threshold parameter τ = 4.6 was established on the training data set, mean(Pijk) and r.m.s.d.(Pijk) are an average and a root mean square difference of all map potential values Pijk. The fourth step was to sort the created envelopes by their volumes and filter out those smaller than 100 Å3. All parameters for the map transformations were optimized using a large, diverse set of binding sites. The algorithm was coded with the ICM 1The abbreviations used are: ICM, internal coordinate mechanics; LP-Set, liganded pocket set collected from complexes; UP-Set, unliganded pocket set collected from apo structures; RO, relative overlap of predicted patch to the real binding patch; SCOP, Structural Classification of Proteins; MDM2, ubiquitin-protein ligase e3 mdm2; LFA-1, lymphocyte function associated antigen 1. scripting language (28Abagyan R. Totrov M. Kuznetsov D. ICM: a new method for structure modeling and design: applications to docking and structure prediction from the distorted native conformation.J. Comput. Chem. 1994; 15: 488-506Google Scholar, 29MolSoft ICM 2.8 Program Manual. MolSoft, LLC, San Diego, CA2000Google Scholar). Because our main objective was to develop and validate an accurate algorithm for predicting protein-ligand binding sites to which a typical druglike small molecule may bind, we first studied the size distribution of known drugs. We then compiled a comprehensive data base of appropriate protein ligand binding sites for benchmarking the performance of the pocket prediction algorithm. We started from the October 30, 2003 release of the Protein Data Bank that contained 17,730 crystallographic structures of proteins. Then we collected all structures that are complexed with heteromolecules (tagged by HET keywords in the Protein Data Bank) to compile a data set of observed binding sites from protein-ligand complexes (liganded-pocket set (LP-Set)). This resulted in 13,431 Protein Data Bank entries and 3,561 unique heteromolecules. Several filters were used to produce the liganded protein-ligand binding site data set. First, we introduced a ligand size and frequency filter. Heteromolecules containing fewer than seven heavy atoms were excluded in this research. This excluded metals and popular crystallization buffer components. To avoid a benchmark bias toward the most frequent ligands, we excluded high frequency cofactors or substrates, such as hemes, N-acetyl-d-glucosamine, α-d-mannose, certain sugars, etc. This filter reduced the number of entries to 7,275 (about 54% of all complexes). Second, a receptor quality filter was used. Only entries with resolution better than 2.5Å were retained. Proteins that consist of fewer than 50 or more than 2,000 residues were also excluded. This filter reduced the number of entries to 5,736. We applied other filters that did not reduce the size of the data set significantly but cleaned up the data as follows: (i) heteromolecules that are far away from the receptors (no atoms of the heteromolecules were within 3.5 Å from the receptor) were removed; (ii) heteromolecules that contact the symmetric parts of the receptor (within 3.5 Å in distance) were removed because their binding sites are formed between the asymmetric units, and building a correct model requires biological information; (iii) ion clusters were removed; and (iv) duplicate combinations of Protein Data Bank entries and ligands were removed. The final data set consisted of 5,616 protein-ligand binding sites (LP-Set). That is the combination of 4,711 Protein Data Bank entries with 2,175 unique ligands. Because our main goal was to predict a potential binding envelope from an uncomplexed structure, it was critical to compile a benchmark of unliganded pocket sites (UP-Set). This additional data set helped us to validate the pocket prediction algorithm in a more realistic situation. Unliganded pockets may not be as obvious as the liganded pockets due to the ligand-induced conformational changes. Side chains often obstruct a part of the pocket in the absence of the ligand. The UP-Set was collected by superimposing the protein structure in the LP-Set onto its uncomplexed homologues and mapping the ligand to them. The proteins (apo structures) in the UP-Set had to meet the following criteria. (i) The sequence identity between apo structure and complexed protein is over 95%. (ii) The resolution of the structure is better than 2.5 Å. (iii) There are no mutations among the surface residues within 8 Å around the mapped ligand. (iv) There are no other ligands within 8 Å around the mapped ligand in the apo structure. Only the single chain receptors in the LP-Set were used to find the apo structures. The only reason for using the single chain receptors was to simplify the process of identifying homologues and superimposing the structures. We preserved all alternative binding pockets in the UP-Set to avoid an arbitrary choice of one representative unliganded pocket. A total of 11,510 unliganded pockets projections were collected from the eligible Protein Data Bank entries. These mapped binding sites corresponded to 1,445 binding sites from LP-Set. As a result, the UP-Set contains on average about eight different mappings of the same binding site in LP-Set. We calculated transformed version of the three-dimensional Lennard-Jones potential on a 1-Å grid surrounding the entire protein surface (see “Materials and Methods”) and then contoured at an optimized level to create pocket envelopes. The pocket envelope was represented by a triangulated surface (Fig. 1). We chose this potential because in contrast to purely geometrical methods it has a clear physical meaning, and at the same time it does not require any knowledge of the chemical nature of a ligand. Additionally the van der Waals component of the binding energy is present in complexes of various physical natures, including hydrophobic, charged, polar, or mixed complexes. After predicting the location of potential binding envelopes, we evaluated the quality of those predictions. The accuracy of each prediction was measured by the overlap of protein atoms in contact with the ligand and protein atoms in contact with the predicted envelope. This relative overlap (RO) parameter was calculated as follows, RO=(AL∩AE)/AL(Eq. 5) where AL is the solvent-accessible area of the receptor atoms within 3.5 Å from a bound ligand, and AE is the solvent-accessible area of the receptor atoms within 3.5 Å from the predicted envelope. A completely failed prediction would have RO equal to zero, whereas a perfect prediction would have RO close to 1.0. However, this requirement would be too strict because obviously the same pocket may bind different ligands, and all these ligands may share a core site but extend in different directions. We define the threshold of successful prediction as RO = 0.5, i.e. at least 50% of the accessible area AL has to be overlapped by the predicted area (AE). We first applied PocketFinder to the LP-Set to evaluate its performance on holo structures. The method successfully identified 96.8% of the 5,616 ligand binding sites as having a relative overlap RO > 0.5 between the observed and predicted sites (see the definition of the RO measure above). Furthermore 55.2% of binding sites were perfectly identified (RO = 1.0), and 85.7% of binding sites were identified with RO higher than 0.8, i.e. greater than 80.0% overlap (Fig. 2). A more realistic test was performed using a data set consisting of 11,510 binding sites collected from uncomplexed structures (the UP-Set). We expected that due to side-chain movements the predictability of the binding pockets would be somewhat reduced. Surprisingly 95.0% of the binding sites were predicted with RO greater than 0.5. These results suggest that the method is tolerant to the conformational variability of binding sites in uncomplexed structures. However, in the high RO region, only 67.0% of the apo binding sites showed RO higher than 0.8 as compared with 85.7% in the LP-Set (Fig. 2). For the percentage of completely overlapped binding sites (RO = 1.0), it was 37.5% compared with 55.2% for the LP-Set. We studied the effect of different conformational rearrangements observed in the unliganded binding sites of the UP-Set upon the prediction success. Because the 11,510 binding sites of the UP-Set were inferred based on 1,445 binding sites in the LP-Set, we grouped them into 1,445 clusters and sorted the clusters by the RO of LP-Set. The distribution of the RO values of UP-Set for 1,445 clusters is shown in Fig. 3. From this distribution we observed the following. (i) The overwhelming majority of UP-Set had RO > 0.5. (ii) For the clusters with high RO of LP-Set (RO > 0.9), the relative overlap of the UP-Set decreased compared with the LP-Set, but the majority were still above 0.8. However, some cases had much larger deterioration of the relative overlap. (iii) For the clusters with lower RO of LP-Set (RO < 0.8), the picture was mixed. We observed better overlap for the UP-Set conformations than the LP-Set, implying some “opening” of unliganded pockets. However, for the majority of 1,445 clusters the unliganded pockets appear more difficult to predict than the liganded pockets. Overall we may conclude that although the unliganded conformations of receptors were not as good as the liganded conformations for the binding site identification, there was only a marginal deterioration of recognition performance. To investigate the difference of the predicted envelopes between holo and apo structures, we applied our algorithm to a set of five enzymes with induced domain movements as described by Hayward (30Hayward S. Identification of specific interactions that drive ligand-induced closure in five enzymes with classic domain movements.J. Mol. Biol. 2004; 339: 1001-1021Google Scholar). For each pair, we superimposed the core domains of the proteins. For enzymes with relatively small domain movements (Fig. 4, A and B), the envelopes predicted from both holo and apo conformations are very similar and perfectly captured the bound ligands. For the enzymes with large domain movements (Fig. 4, C, D, and E), the envelopes show different sizes and shapes between the holo and apo structures. However, the ligand in case C was still successfully captured, and the ligands in cases D and E were partially captured. The algorithm may produce any number of putative binding envelopes depending on the nature of the protein surface and the cutoff volume of a predicted envelope. Fig. 5A depicts the relationship between the protein size and the number of predicted pockets. The number of envelopes predicted per protein is roughly proportional to the volume of the protein (one pocket per 10,000 Å3); for the majority of the proteins (volume is below or close to 50,000 Å3) in the data set, less than four envelopes were predicted. If several envelopes are generated, it is important to know which envelope corresponds to the real binding site(s) in the receptor. In the majority of cases, we found that the envelopes overlapping with the ligand were the largest envelopes. More specifically, 80.9% of the predicted envelopes overlapping with the ligand were the largest ones, and 11.8% were the second largest. We sorted the envelopes by volume, and Fig. 5B shows the rank of real pocket. This implies that although the algorithm may return several additional putative binding sites, the top two largest predicted binding sites cover as many as 92.7% of the real binding sites. Laskowski and colleagues (31Laskowski R.A. Luscombe N.M. Swindells M.B. Thornton J.M. Protein clefts in molecular recognition and function.Protein Sci. 1996; 5: 2438-2452Google Scholar) reported a similar result based on a data set of 67 single chain enzymes. The size of the predicted envelope is another important criterion of the prediction because oversized envelopes contribute to binding site coverage but are less precise at pointing to its exact location. The PocketFinder was optimized to generate envelopes that just fill the binding sites. To investigate the extent of overprediction, we calculated (i) the volumes of the envelopes with respect to the volumes of the bound ligands and (ii) the ratio of the predicted binding patch to the whole surface area of the receptor for each binding site in the LP-Set. The average value of the ligands was 439.6 Å3, whereas the average volume of the envelopes was 610.8 Å3. On average, the predicted envelope was ∼1.4 times larger in volume than the ligand. Considering that most ligands should fit inside their pockets and the ligands bound to those receptors were just some representative examples of all possible ligands, this value seems reasonable. For the predicted binding patch, the average ratio to the whole surface of the receptor was 4.7%. Fig. 5C shows the distribution of the ratio of contact area to the whole surface area of the protein for bound ligands and predicted envelopes. This distribution shows that the size of the predicted binding patch was close to the real binding area. Considered together with the high RO values of the prediction, we can conclude that the prediction was accurate. Unfortunately there are too many pockets in the benchmark to show each prediction graphically. However, to leave a visual impression of the results, we created an unbiased subset of the benchmark pockets using the following selection procedure. We first annotated the proteins in the LP-Set based on the Structural Classification of Proteins (SCOP) data base (32Andreeva A. Howorth D. Brenner S.E. Hubbard T.J. Chothia C. Murzin A.G. SCOP database in 2004: refinements integrate structure and sequence family data.Nucleic Acids Res. 2004; 32: D226-D229Google Scholar). They belonged to 10 of 11 classes in SCOP, showing high diversity of our data set. Structures from 261 folds, 347 superfamilies, and 589 families were represented. These 20 representatives in Fig. 6 were selected from different folds based on the following criteria: (i) the envelope size between 300 and 700 Å3, the range of sizes that was most populated, and (ii) the folds that had only one representative in LP-Set were chosen to avoid any subjective selection of representative proteins from folds. We then sorted these folds by their SCOP fold ID and chose the first 20 folds. For each protein, only the largest and the second largest (if it existed) envelopes were displayed. Fig. 6 demonstrates the accuracy and reliability of the envelope prediction. It is important to note that in most cases the predicted envelopes captured the ligands correctly with size and shape being close to the bound ligands. The complete data sets (LP-Set and UP-Set) and prediction results are available upon request. As the size of structural proteome grows, we can start compiling a pocketome, defined as the collection of all possible small molecule envelopes present in a cell. Here we collected, compared, and classified all envelopes from human protein-ligand complexes with known three-dimensional structure. There are 943 binding pockets from human proteins in the LP-Set. For the envelope clustering, five shape descriptors (volume, surface area, and the three principal axis lengths) and two physicochemical descriptors (Kyte and Doolittle hydrophobicity parameter (33Kyte J. Doolittle R.F. A simple method for displaying the hydropathic character of a protein.J. Mol. Biol. 1982; 157: 105-132Google Scholar) and electrostatic charge) were used to characterize the envelope. After representing each envelope by a seven-element vector, a Euclidean distance matrix was calculated. A hierarchical clustering algorithm was used to build a tree from the distance matrix. We compared the cluster tree of the human ligand envelopes with the tree based on the chemical similarity between ligands. This comparison highlights the complex relationship between ligands and their pockets. We know that the same pocket can bind chemically different ligands of somewhat different size and properties, and the same small molecule can bind to rather different pockets. Fig. 7 shows a fragment of those trees constructed from 160 binding sites (the tree for all 943 binding sites of available human protein-ligand complex structures is available upon request) where each branch is labeled by the Protein Data Bank code and ligand ID. These trees represent three different types of information: the classification by the sequences provides evolutionary information, the classification by the ligands shows chemical similarity (Tanimoto coefficient) of the bound ligands, and the classification by the envelopes provides the similarity of the binding pockets. We investigated five groups of ligands in the ligand tree (colored by green, blue, orange, yellow, and pink, respectively) and observed that, in most of the cases, the ligands belonging to the same group were clustered closely in the other trees as well. However, the difference between the classifications was observed, reflecting the complex nature of the ligand-pocket relationship: a ligand can bind to different pockets, and the same pocket may be good for different ligands. The clustering tree of putative ligand envelopes provided a new viewpoint for understanding the ligand-pocket relationship. By counting the number of similar envelopes we can estimate how common or rare a pocket is in the pocketome. We demonstrated that the PocketFinder algorithm can successfully identify or predict protein-ligand binding envelopes from complexed and apo structures. We also compiled the first draft of a human “pocketome” consisting of 943 envelopes. Albeit incomplete, this pocketome can grow along with the structural proteome.