Human blood plasma can be obtained relatively noninvasively and contains proteins from most, if not all, tissues of the body. Therefore, an extensive, quantitative catalog of plasma proteins is an important starting point for the discovery of disease biomarkers. In 2005, we showed that different proteomics measurements using different sample preparation and analysis techniques identify significantly different sets of proteins, and that a comprehensive plasma proteome can be compiled only by combining data from many different experiments. Applying advanced computational methods developed for the analysis and integration of very large and diverse data sets generated by tandem MS measurements of tryptic peptides, we have now compiled a high-confidence human plasma proteome reference set with well over twice the identified proteins of previous high-confidence sets. It includes a hierarchy of protein identifications at different levels of redundancy following a clearly defined scheme, which we propose as a standard that can be applied to any proteomics data set to facilitate cross-proteome analyses. Further, to aid in development of blood-based diagnostics using techniques such as selected reaction monitoring, we provide a rough estimate of protein concentrations using spectral counting. We identified 20,433 distinct peptides, from which we inferred a highly nonredundant set of 1929 protein sequences at a false discovery rate of 1%. We have made this resource available via PeptideAtlas, a large, multiorganism, publicly accessible compendium of peptides identified in tandem MS experiments conducted by laboratories around the world. Human blood plasma can be obtained relatively noninvasively and contains proteins from most, if not all, tissues of the body. Therefore, an extensive, quantitative catalog of plasma proteins is an important starting point for the discovery of disease biomarkers. In 2005, we showed that different proteomics measurements using different sample preparation and analysis techniques identify significantly different sets of proteins, and that a comprehensive plasma proteome can be compiled only by combining data from many different experiments. Applying advanced computational methods developed for the analysis and integration of very large and diverse data sets generated by tandem MS measurements of tryptic peptides, we have now compiled a high-confidence human plasma proteome reference set with well over twice the identified proteins of previous high-confidence sets. It includes a hierarchy of protein identifications at different levels of redundancy following a clearly defined scheme, which we propose as a standard that can be applied to any proteomics data set to facilitate cross-proteome analyses. Further, to aid in development of blood-based diagnostics using techniques such as selected reaction monitoring, we provide a rough estimate of protein concentrations using spectral counting. We identified 20,433 distinct peptides, from which we inferred a highly nonredundant set of 1929 protein sequences at a false discovery rate of 1%. We have made this resource available via PeptideAtlas, a large, multiorganism, publicly accessible compendium of peptides identified in tandem MS experiments conducted by laboratories around the world. Blood plasma contains a combination of subproteomes derived from different tissues, and thus, it potentially provides a window into an individual's state of health. Therefore, a detailed analysis of the plasma proteome holds promise as a source of biomarkers that can be used for the diagnosis and staging of diseases, as well as for monitoring progression and response to therapy. For many years, before the era of proteomics, the classic multivolume reference, The Plasma Proteins by Frank Putnam (1975–1989) (1Putnam F.W. The Plasma Proteins. 2nd Ed. Academic Press, New York1975–1989Google Scholar), provided a foundation for studies of plasma proteins. In 2002, Anderson and Anderson (2Anderson N.L. Anderson N.G. The human plasma proteome: history, character, and diagnostic prospects.Mol. Cell Proteomics. 2002; 1: 845-867Abstract Full Text Full Text PDF PubMed Scopus (3528) Google Scholar) published a review of 289 plasma proteins studied by a wide variety of methods, and quantified primarily with immunoassays, providing an early plasma proteome reference set. Subsequently, the widespread adoption of liquid chromatography-tandem MS (LC-MS/MS) 1The abbreviations used are:LC-MS/MSliquid chromatography-tandem MSHUPOHuman Proteome OrganizationPPPPlasma Proteome ProjectIPIInternational Protein IndexTPPTrans-Proteomic PipelineFDRfalse discovery ratePSMpeptide-spectrum matchSRMselected reaction monitoring. techniques resulted in a rapid increase in plasma proteome-related data sets that needed to be similarly integrated to form a next-generation comprehensive human plasma proteome reference set. In 2002, the Human Proteome Organization (HUPO) launched Phase I of its Human Plasma Proteome Project (PPP) and provided reference specimens of serum and EDTA-, citrate-, and heparin-anticoagulated plasma to 55 laboratories. Eighteen laboratories contributed tandem MS findings and protein identifications, which were integrated by a collaborative process into a core data set of 3020 proteins from the International Protein Index (IPI) database (3Kersey P.J. Duarte J. Williams A. Karavidopoulou Y. Birney E. Apweiler R. The International Protein Index: an integrated database for proteomics experiments.Proteomics. 2004; 4: 1985-1988Crossref PubMed Scopus (640) Google Scholar) containing two or more identified peptides, plus filters for smaller, higher confidence lists (4Omenn G.S. States D.J. Adamski M. Blackwell T.W. Menon R. Hermjakob H. Apweiler R. Haab B.B. Simpson R.J. Eddes J.S. Kapp E.A. Moritz R.L. Chan D.W. Rai A.J. Admon A. Aebersold R. Eng J. Hancock W.S. Hefta S.A. Meyer H. Paik Y.K. Yoo J.S. Ping P. Pounds J. Adkins J. Qian X. Wang R. Wasinger V. Wu C.Y. Zhao X. Zeng R. Archakov A. Tsugita A. Beer I. Pandey A. Pisano M. Andrews P. Tammen H. Speicher D.W. Hanash S.M. Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core data set of 3020 proteins and a publicly-available database.Proteomics. 2005; 5: 3226-3245Crossref PubMed Scopus (687) Google Scholar, 5Omenn G. Exploring the Human Plasma Proteome. Wiley-VCH, New York, NY2006Crossref Scopus (0) Google Scholar). A stringent re-analysis of the PPP data, including adjustment for multiple comparisons, yielded 889 proteins (6States D.J. Omenn G.S. Blackwell T.W. Fermin D. Eng J. Speicher D.W. Hanash S.M. Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study.Nat. Biotechnol. 2006; 24: 333-338Crossref PubMed Scopus (286) Google Scholar). liquid chromatography-tandem MS Human Proteome Organization Plasma Proteome Project International Protein Index Trans-Proteomic Pipeline false discovery rate peptide-spectrum match selected reaction monitoring. Meanwhile, in 2004, Anderson et al. (7Anderson N.L. Polanski M. Pieper R. Gatlin T. Tirumalai R.S. Conrads T.P. Veenstra T.D. Adkins J.N. Pounds J.G. Fagan R. Lobley A. The human plasma proteome: a nonredundant list developed by combination of four separate sources.Mol. Cell Proteomics. 2004; 3: 311-326Abstract Full Text Full Text PDF PubMed Scopus (749) Google Scholar) published a compilation of 1175 nonredundant plasma proteins reported in the 2002 literature review and in three published experimental data sets (8Pieper R. Gatlin C.L. Makusky A.J. Russo P.S. Schatz C.R. Miller S.S. Su Q. McGrath A.M. Estock M.A. Parmar P.P. Zhao M. Huang S.T. Zhou J. Wang F. Esquer-Blasco R. Anderson N.L. Taylor J. Steiner S. The human serum proteome: display of nearly 3700 chromatographically separated protein spots on two-dimensional electrophoresis gels and identification of 325 distinct proteins.Proteomics. 2003; 3: 1345-1364Crossref PubMed Scopus (462) Google Scholar, 9Adkins J.N. Varnum S.M. Auberry K.J. Moore R.J. Angell N.H. Smith R.D. Springer D.L. Pounds J.G. Toward a human blood serum proteome: analysis by multidimensional separation coupled with mass spectrometry.Mol. Cell Proteomics. 2002; 1: 947-955Abstract Full Text Full Text PDF PubMed Scopus (713) Google Scholar, 10Tirumalai R.S. Chan K.C. Prieto D.A. Issaq H.J. Conrads T.P. Veenstra T.D. Characterization of the low molecular weight human serum proteome.Mol. Cell Proteomics. 2003; 2: 1096-1103Abstract Full Text Full Text PDF PubMed Scopus (727) Google Scholar). Only 46 were reported in all four sources, suggesting variability in the proteins detected by different methods, high false positive rates because of insufficiently stringent identification criteria, and nonuniform methods for assigning protein identifications. Shen et al. (11Shen Y. Jacobs J.M. Camp 2nd, D.G. Fang R. Moore R.J. Smith R.D. Xiao W. Davis R.W. Tompkins R.G. Ultra-high-efficiency strong cation exchange LC/RPLC/MS/MS for high dynamic range characterization of the human plasma proteome.Anal. Chem. 2004; 76: 1134-1144Crossref PubMed Scopus (282) Google Scholar) reported 800 to 1682 proteins from human plasma, depending on the proteolytic enzymes used and the criteria applied for identification; Omenn et al. (4Omenn G.S. States D.J. Adamski M. Blackwell T.W. Menon R. Hermjakob H. Apweiler R. Haab B.B. Simpson R.J. Eddes J.S. Kapp E.A. Moritz R.L. Chan D.W. Rai A.J. Admon A. Aebersold R. Eng J. Hancock W.S. Hefta S.A. Meyer H. Paik Y.K. Yoo J.S. Ping P. Pounds J. Adkins J. Qian X. Wang R. Wasinger V. Wu C.Y. Zhao X. Zeng R. Archakov A. Tsugita A. Beer I. Pandey A. Pisano M. Andrews P. Tammen H. Speicher D.W. Hanash S.M. Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core data set of 3020 proteins and a publicly-available database.Proteomics. 2005; 5: 3226-3245Crossref PubMed Scopus (687) Google Scholar) re-analyzed those raw spectra with HUPO PPP-I search parameters and matched only 213 to the PPP-I core data set. Chan et al. reported 1444 unique proteins in serum using a multidimensional peptide separation strategy (12Chan K.C. Lucas D.A. Hise D. et al.Serum/Plasma Proteome.Clinical Proteomics. 2004; 1: 101-225Crossref Scopus (104) Google Scholar), of which 1019 mapped to IPI and 257 to the PPP-I core data set. These previous efforts highlight the challenges associated with accurately determining the number of proteins inferred from large proteomic data sets, and with comparing the proteins identified in different data sets. In 2005, we used a uniform method based on the Trans-Proteomic Pipeline (13Keller A. Eng J. Zhang N. Li X.J. Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats.Mol. Syst. Biol. 2005; 12005.0017Crossref PubMed Scopus (595) Google Scholar) to create the first Human Plasma PeptideAtlas (14Deutsch E.W. Eng J.K. Zhang H. King N.L. Nesvizhskii A.I. Lin B. Lee H. Yi E.C. Ossola R. Aebersold R. Human Plasma PeptideAtlas.Proteomics. 2005; 5: 3497-3500Crossref PubMed Scopus (123) Google Scholar), containing 28 LC-MS/MS data sets and over 1.9 million spectra. Using a PeptideProphet (15Keller A. Nesvizhskii A.I. Kolker E. Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.Anal. Chem. 2002; 74: 5383-5392Crossref PubMed Scopus (3857) Google Scholar) probability threshold of p > = 0.90, 6929 peptides were identified at a peptide false discovery rate (FDR) of 12%, as estimated by PeptideProphet's data model, mapping to about 960 distinct proteins. Comparison of protein identifiers with those from studies cited above showed quite limited overlap. From the 2005 Human Plasma PeptideAtlas, as well as the PPP-I collaboration, we concluded that different proteomics experiments using different samples, depletion, fractionation, sample preparation, and analysis techniques identify significantly different sets of proteins. We decided that a comprehensive plasma proteome could be compiled only by combining data from many diverse, high-quality experiments, and strove to collect as much such data as possible. The resulting 2007 Human Plasma PeptideAtlas (unpublished), encompassing 53 LC-MS/MS data sets, identified 27,801 distinct peptides—four times the number in the 2005 Atlas—and 2738 proteins. In 2008, Schenk et al. (16Schenk S. Schoenhals G.J. de Souza G. Mann M. A high confidence, manually validated human blood plasma protein reference set.BMC Med. Genomics. 2008; 1: 41Crossref PubMed Google Scholar) published a high-confidence set of 697 nonimmunoglobulin human plasma proteins based on measuring a single pooled sample on two high-end MS instruments after depletion, prefractionation, and protease inhibition, with stringent validation methods. This highly nonredundant set of proteins likely contains fewer false-positives than any previous MS-derived plasma proteome reference set. The goal of the present work was to compile a larger human plasma proteome reference set of similar high confidence by creating a new release of the Human Plasma PeptideAtlas incorporating more data than in 2007 and interpreting the data using more stringent criteria. We searched raw data sets submitted to PeptideAtlas and performed peptide validation using a uniform pipeline (Fig. 1), compiled several sets of corresponding protein identifications at different clearly defined levels of redundancy (Fig. 2), and, using a spectral counting technique, provided a rough estimate of concentrations for a highly nonredundant set of protein sequences to guide blood-based diagnostic efforts such as doping using stable isotope-labeled synthetic reference peptides for selected reaction monitoring (SRM) experiments (Fig. 3). The result is a plasma proteome reference set (Fig. 4) (supplemental Tables S3 and S6) containing 1929 highly nonredundant protein sequences at an estimated 1% FDR.Fig. 2A, Six shaded bars (two of which overlap) represent sets of protein identifications at various levels of redundancy under the Cedar scheme. Tallies are for the Human Plasma PeptideAtlas. Beginning at bottom: •Exhaustive set: contains any protein sequence in the atlas' combined protein sequence database (Swiss-Prot 2010–04 + IPI v3.71 + Ensembl v57.37) that includes at least one identified peptide. •Sequence-unique set: exhaustive set with exact duplicates removed. •Peptide-set-unique set: a subset of the sequence-unique set within which no two protein sequences include the exact same set of identified peptides. •Not subsumed set: peptide-set-unique set with subsumed protein sequences removed (those for which the identified peptides form a proper subset of the identified peptides for another protein sequence). •Canonical set: a subset of the not subsumed set within which no protein sequence includes more than 80% of the peptides of any other member of the set. Protein sequences that are not subsumed, but not canonical are called possibly distinguished, because each has a peptide set that is close, but not identical, to that of a canonical protein sequence. •Covering set: a minimal set of protein sequences that can explain all of the identified peptides. B, Peptide-centric illustration of six protein sequences in a hypothetical ProteinProphet protein group, in order of descending ProteinProphet probability. Heavy lines represent protein chains (with invented identifiers); lighter lines represent observed peptides. Vertically aligned peptides are identical in sequence, and one instance of each is labeled with the letter of the highest probability protein to which it maps. A' is indistinguishable from A because it contains exactly the same set of observed peptides; both are equally likely to exist in the sample(s), but A is labeled canonical because its Swiss-Prot protein identifier is preferred. E is subsumed by A because its observed peptides form a subset of A's peptides; it is also subsumed by A', C, and D. Protein sequences B, C, and D are labeled possibly distinguished because the peptide set for each is slightly different from that of A. The three protein sequences with superscript C comprise the smallest subset of sequences sufficient to explain all the observed peptides in the group, and thus belong to the covering set.View Large Image Figure ViewerDownload Hi-res image Download (PPT)Fig. 3Plasma protein concentrations determined using immunoassay and antibody microarray analysis (40Haab B.B. Geierstanger B.H. Michailidis G. Vitzthum F. Forrester S. Okon R. Saviranta P. Brinker A. Sorette M. Perlee L. Suresh S. Drwal G. Adkins J.N. Omenn G.S. Immunoassay and antibody microarray analysis of the HUPO Plasma Proteome Project reference specimens: systematic variation between sample types and calibration of mass spectrometry data.Proteomics. 2005; 5: 3278-3291Crossref PubMed Scopus (129) Google Scholar) versus normalized spectral counts from the Human Plasma Non-glyco PeptideAtlas, plotted on a log scale. Each small square represents a protein found in both sources. Hollow squares represent proteins that were excluded when drawing the trend line (either depleted (albumin) or fewer than four spectrum counts). The line segments above and below the trend line are fit to the standard deviation of the y axis values computed at intervals of 0.1 (log scale). The arrows on the left represent proteins with reported concentrations in (40Haab B.B. Geierstanger B.H. Michailidis G. Vitzthum F. Forrester S. Okon R. Saviranta P. Brinker A. Sorette M. Perlee L. Suresh S. Drwal G. Adkins J.N. Omenn G.S. Immunoassay and antibody microarray analysis of the HUPO Plasma Proteome Project reference specimens: systematic variation between sample types and calibration of mass spectrometry data.Proteomics. 2005; 5: 3278-3291Crossref PubMed Scopus (129) Google Scholar) but no spectrum counts. The histogram at the right depicts an estimate of the completeness of the Human Plasma Non-glyco PeptideAtlas as a function of concentration, calculated as the number of points divided by the total number of points and arrows within each decade. See supplemental Fig. S2, for N-Glyco atlas.View Large Image Figure ViewerDownload Hi-res image Download (PPT)Fig. 4Proteins identified by each experiment. Each bar represents one of the 91 experiments, ordered as in supplemental Table S4. Height of dark bar = canonical protein sequences identified per experiment; total height (dark + light) = cumulative tally; width of bar = PSM count. See supplemental Fig. S5, for a similar graph of distinct peptides.View Large Image Figure ViewerDownload Hi-res image Download (PPT) We collected raw spectra from 91 high-quality LC-MS/MS data sets ((4Omenn G.S. States D.J. Adamski M. Blackwell T.W. Menon R. Hermjakob H. Apweiler R. Haab B.B. Simpson R.J. Eddes J.S. Kapp E.A. Moritz R.L. Chan D.W. Rai A.J. Admon A. Aebersold R. Eng J. Hancock W.S. Hefta S.A. Meyer H. Paik Y.K. Yoo J.S. Ping P. Pounds J. Adkins J. Qian X. Wang R. Wasinger V. Wu C.Y. Zhao X. Zeng R. Archakov A. Tsugita A. Beer I. Pandey A. Pisano M. Andrews P. Tammen H. Speicher D.W. Hanash S.M. Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core data set of 3020 proteins and a publicly-available database.Proteomics. 2005; 5: 3226-3245Crossref PubMed Scopus (687) Google Scholar, 12Chan K.C. Lucas D.A. Hise D. et al.Serum/Plasma Proteome.Clinical Proteomics. 2004; 1: 101-225Crossref Scopus (104) Google Scholar, 17Qian W.J. Monroe M.E. Liu T. Jacobs J.M. Anderson G.A. Shen Y. Moore R.J. Anderson D.J. Zhang R. Calvano S.E. Lowry S.F. Xiao W. Moldawer L.L. Davis R.W. Tompkins R.G. Camp 2nd, D.G. Smith R.D. Quantitative proteome analysis of human plasma following in vivo lipopolysaccharide administration using 16O/18O labeling and the accurate mass and time tag approach.Mol. Cell Proteomics. 2005; 4: 700-709Abstract Full Text Full Text PDF PubMed Scopus (158) Google Scholar, 18Whiteaker J.R. Zhang H. Eng J.K. Fang R. Piening B.D. Feng L.C. Lorentzen T.D. Schoenherr R.M. Keane J.F. Holzman T. Fitzgibbon M. Lin C. Zhang H. Cooke K. Liu T. Camp 2nd, D.G. Anderson L. Watts J. Smith R.D. McIntosh M.W. Paulovich A.G. Head-to-head comparison of serum fractionation techniques.J. Proteome Res. 2007; 6: 828-836Crossref PubMed Scopus (142) Google Scholar, 19Liu T. Qian W.J. Gritsenko M.A. Camp 2nd, D.G. Monroe M.E. Moore R.J. Smith R.D. Human plasma N-glycoproteome analysis by immunoaffinity subtraction, hydrazide chemistry, and mass spectrometry.J. Proteome Res. 2005; 4: 2070-2080Crossref PubMed Scopus (372) Google Scholar, 20Liu T. Qian W.J. Gritsenko M.A. Xiao W. Moldawer L.L. Kaushal A. Monroe M.E. Varnum S.M. Moore R.J. Purvine S.O. Maier R.V. Davis R.W. Tompkins R.G. Camp 2nd, D.G. Smith R.D. High dynamic range characterization of the trauma patient plasma proteome.Mol. Cell Proteomics. 2006; 5: 1899-1913Abstract Full Text Full Text PDF PubMed Scopus (137) Google Scholar, 21Armandola E.A. Proteome profiling in body fluids and in cancer cell signaling.Med. Gen. Med. 2003; 5: 18PubMed Google Scholar) and several unpublished; supplemental Table S4, Supplemental Data), including 44 from Phase I PPP experiments, 13 from PPP Phase II, the Chan data set, and several from corporate research laboratories. Data from both plasma and serum samples, a variety of sample preparation techniques (depleted/not depleted, various fractionation schemata, use of protease inhibitors, N-linked glycocapture enrichment (22Zhang H. Yi E.C. Li X.J. Mallick P. Kelly-Spratt K.S. Masselon C.D. Camp 3rd, D.G. Smith R.D. Kemp C.J. Aebersold R. High Throughput Quantitative Analysis of Serum Proteins Using Glycopeptide Capture and Liquid Chromatography Mass Spectrometry.Mol. Cell. Proteomics. 2005; 4: 144-155Abstract Full Text Full Text PDF PubMed Scopus (189) Google Scholar)), and analysis on a variety of instruments were included. All samples were digested with trypsin. Each data set consisted of between one and 38,252 LC-MS/MS runs (median 22) for a total of 48,789 LC-MS/MS runs 2This total includes two extraordinarily large experiments together comprising 45,160 runs.. For analysis, we separated the data sets into two groups, glycocapture and nonglycocapture, and later combined the results. The 69 data sets for the nonglycocapture samples were all selected from ion trap experiments because we wished to search them against an ion trap spectral library. Data were converted to mzXML (23Pedrioli P.G. Eng J.K. Hubley R. Vogelzang M. Deutsch E.W. Raught B. Pratt B. Nilsson E. Angeletti R.H. Apweiler R. Cheung K. Costello C.E. Hermjakob H. Huang S. Julian R.K. Kapp E. McComb M.E. Oliver S.G. Omenn G. Paton N.W. Simpson R. Smith R. Taylor C.F. Zhu W. Aebersold R. A common open representation of mass spectrometry data and its application to proteomics research.Nat. Biotechnol. 2004; 22: 1459-1466Crossref PubMed Scopus (649) Google Scholar) and searched with SpectraST version 4.0 (24Lam H. Deutsch E.W. Eddes J.S. Eng J.K. King N. Stein S.E. Aebersold R. Development and validation of a spectral library searching method for peptide identification from MS/MS.Proteomics. 2007; 7: 655-667Crossref PubMed Scopus (387) Google Scholar) against a spectral library consisting of the NIST 3.0 human spectral library (261,777 consensus spectra) (25NIST Peptide Mass Spectral Libraries http://peptide.nist.govGoogle Scholar) plus one SpectraST-generated (26Lam H. Deutsch E.W. Aebersold R. Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics.J. Proteome Res. 2010; 9: 605-610Crossref PubMed Scopus (79) Google Scholar) decoy for each NIST spectrum. This library contains consensus spectra derived from actual identified spectra, some of which include missed cleavages and/or modifications. A precursor mass tolerance of 3.0 Th (thomson) was used. See supplemental Data for complete SpectraST parameters. The search results for each experiment were processed using the Trans-Proteomic Pipeline (TPP) (13Keller A. Eng J. Zhang N. Li X.J. Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats.Mol. Syst. Biol. 2005; 12005.0017Crossref PubMed Scopus (595) Google Scholar), as shown in Fig. 1, left (see supplemental Data for TPP parameters used). PeptideProphet (15Keller A. Nesvizhskii A.I. Kolker E. Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.Anal. Chem. 2002; 74: 5383-5392Crossref PubMed Scopus (3857) Google Scholar) computed a probability for each peptide-spectrum match (PSM) for peptides of length seven or greater. iProphet (27Shteynberg, D., Deutsch, E., Lam, H., Eng, J., Sun, Z., Tasman, N., Mendoza, L., Moritz, R. L., Aebersold, R., Nesvizhskii, A. I., (submitted) iProphet: Improved statistical validation of peptide identifications in shotgun proteomics. Mol. Cell ProteomicsGoogle Scholar) was applied to the PeptideProphet results to improve discrimination by modeling five additional properties of the data beyond those modeled by PeptideProphet, and adjusting peptide probabilities accordingly. The five models are number of sibling searches (rewards or penalizes identifications based on the output of multiple search engines, not applicable here), number of replicate spectra (models the assumption that precursor ions with multiple high probability identifications are more likely to be correct), number of sibling experiments (models the assumption that precursor ions observed in multiple experiments and matched to the same peptide sequence are more likely to be correct), number of sibling ions (rewards peptides identified by precursors with different charges), and number of sibling modifications (rewards peptides identified with different mass modifications). RefreshParser mapped each PSM to a combined protein sequence database derived from Swiss-Prot 2010–04 including splice variants (28Boutet E. Lieberherr D. Tognolli M. Schneider M. Bairoch A. UniProtKB/Swiss-Prot.Methods Mol. Biol. 2007; 406: 89-112PubMed Google Scholar, 29Boeckmann B. Bairoch A. Apweiler R. Blatter M.C. Estreicher A. Gasteiger E. Martin M.J. Michoud K. O'Donovan C. Phan I. Pilbout S. Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.Nucleic Acids Res. 2003; 31: 365-370Crossref PubMed Scopus (2745) Google Scholar), IPI v3.71, Ensembl v57.37 (30Hubbard T.J. Aken B.L. Ayling S. Ballester B. Beal K. Bragin E. Brent S. Chen Y. Clapham P. Clarke L. Coates G. Fairley S. Fitzgerald S. Fernandez-Banet J. Gordon L. Graf S. Haider S. Hammond M. Holland R. Howe K. Jenkinson A. Johnson N. Kahari A. Keefe D. Keenan S. Kinsella R. Kokocinski F. Kulesha E. Lawson D. Longden I. Megy K. Meidl P. Overduin B. Parker A. Pritchard B. Rios D. Schuster M. Slater G. Smedley D. Spooner W. Spudich G. Trevanion S. Vilella A. Vogel J. White S. Wilder S. Zadissa A. Birney E. Cunningham F. Curwen V. Durbin R. Fernandez-Suarez X.M. Herrero J. Kasprzyk A. Proctor G. Smith J. Searle S. Flicek P. Ensembl 2009.Nucleic Acids Res. 2009; 37: D690-697Crossref PubMed Scopus (689) Google Scholar), and cRAP v1.0 (31cRAP, Common Repository of Adventitious Proteins.http://www.thegpm.org/cRAPGoogle Scholar). In many cases, the exact same protein sequence is included in the combined database multiple times because it is contained in multiple databases and/or because the Ensembl database includes many duplicates. Each PSM was mapped to all protein sequences containing the PSM's peptide sequence; in many cases this resulted in a PSM mapping to multiple protein sequences that are duplicates, splice variants, or paralogs. For very large data sets, the FDR at the peptide level tends to be much larger than that at the PSM level, and, at the protein level, much larger still (32Reiter L. Claassen M. Schrimpf S.P. Jovanovic M. Schmidt A. Buhmann J.M. Hengartner M.O. Aebersold R. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry.Mol. Cell Proteomics. 2009; 8: 2405-2417Abstract Full Text Full Text PDF PubMed Scopus (251) Google Scholar). Thus, in order to obtain a 1% decoy-estimated protein FDR for the final Human Plasma PeptideAtlas, a stringent PeptideProphet-estimated PSM FDR filter of 0.0002 (corresponding to probability cutoffs ranging from 0.9903 to 0.9998) was applied to each experiment. ProteinProphet (33Nesvizhskii A.I. Keller A. Kolker E. Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry.Anal. Chem. 2003; 75: 4646-4658Crossref PubMed Scopus (3589) Google Scholar) was then run on each experiment, assigning to each distinct peptide the probability of its highest probability PSM, and further adjusting these probabilities using a number of sibling peptides model, which rewards peptides that map to proteins with many identified peptides. The set of identified peptides for the HsSerum NCI Large Survey experiment (12Chan K.C. Lucas D.A. Hise D. et al.Serum/Plasma Proteome.Clinical Proteomics. 2004; 1: 101-225Crossref Scopus (104) Google Scholar) was found to contain many pep