Biochemists have broken barriers in the capabilities of proteomics, providing cell biologists and medical researchers unparalleled opportunities to analyze complex protein mixtures. The next task is to generate new ideas in the choice and preparation of materials to analyze.
The MudPIT Breakthrough
Traditional proteomics methodologies separate complex protein samples by isoelectric point and molecular weight using 2-dimensional gels. Patterns can be compared between samples, but to determine which protein is changing requires isolating individual protein spots, proteolyzing these, and analyzing the mass of each peptide by Matrix-assisted laser desorption/ionization-time of flight (MALDI-TOF) mass spectrometry. The measured peptide masses are searched against the predicted mass values for theoretical digestion of proteins in a sequence database, and the protein is identified by a statistically significant number of matches. Multidimensional Protein Identification Technology (MudPIT) eliminates gel separations. Instead, biochemical fractions containing many proteins are directly proteolyzed and the enormous number of peptides generated, are separated by 2-dimensional liquid chromatography before entering the mass spectrometer. Instead of MALDI-TOF, the procedure employs tandem mass spectrometry so that, after the mass of a peptide is measured, the peptide is fragmented using a collision-induced dissociation cell and the masses of the fragmentation products are determined. Considerable computational effort can typically transform this data into an amino acid sequence. Thus one peptide is often sufficient to identify a protein, a sensitivity advantage that enables identification of minor proteins in a biological fraction that can not be visualized on 2-dimensional gels. Recent studies have identified 1,000 to 2,000 proteins in a single fraction with MudPIT.
The statistical relevance of peptide identification is determined by calculating the probability that the tandem mass spectrum is a random match to a sequence of the same molecular weight in the database. Typically, the predicted fragmentation patterns from database sequences are compared to the observed tandem mass spectrum fragmentation pattern to determine how closely the sequence fits the spectrum. The probability of a random match is calculated based on the frequency of occurrence of fragment ions in the database (which can be influenced by database size). By using this approach, a statistical confidence is assigned to the match (Sadygov and Yates, 2003).
The validity of the algorithm is supported by a study that used MudPIT to identify nuclear envelope proteins (Schirmer et al., 2003). Mass spectra from the nuclear envelope fraction were first analyzed without utilizing statistical cutoff values employed in the Sadygov algorithm and 10 identified genes were cloned so that their nuclear envelope targeting could be confirmed. When the spectra were re-analyzed using the Sadygov algorithm, the matches for two of these proteins fell below the statistical confidence level. The confidence-deficient proteins did not target the nuclear envelope (ECS, L. Florens, JRY, and LG, unpublished) while all proteins above statistical confidence levels targeted the envelope, thus demonstrating the algorithm’s effectiveness in providing a refined list of nuclear membrane proteins with high confidence.
The wide dynamic range and extreme sensitivity of MudPIT opens many possibilities for the application of this powerful tool. While it cannot yet analyze the protein component of an entire mammalian cell, it is more than sufficient to analyze all proteins in a subcellular fraction (including most organelles) or in some pathogenic microorganisms.
Many human diseases are caused by mutations that alter protein levels or prevent interactions with partner proteins (which can result in mislocalization). Proteomics of subcellular fractions can be used to generate a more global map of protein distribution within a cell than can mRNA expression arrays or immunofluorescence. Excellent protocols exist that enrich for organelles such as mitochondria, but some organelles have proven difficult to separate from cellular contaminants. For example, the nuclear envelope cannot be separated from the endoplasmic reticulum because of continuity between their membranes. A subtractive approach was utilized to circumvent this, taking advantage of the fact that a microsomal membrane fraction can be prepared that is rich in endoplasmic reticulum, but free of nuclear envelopes. Thus the two fractions were separately analyzed by MudPIT and all proteins appearing in both fractions were subtracted from the list of nuclear envelope proteins, resulting in an in silico “pure” nuclear envelope fraction (Schirmer et al., 2003). This “subtractive” approach can be applied to any well characterized subcellular fraction, greatly increasing the number of subcellular compartments/organelles whose protein profiles can be sampled by MudPIT.
Proteins in diseased tissue can also be altered by post-translational modifications. With the right approaches, many such modifications can be sampled comprehensively. For example, a recent study identified all 1,075 ubiquitinated proteins in a cell by transfecting cells with a tagged ubiquitin construct, separating ubiquitinated proteins by their tag, and analyzing this fraction by MudPIT (Peng et al., 2003). A similar methodology could be used to analyze modified proteins in healthy vs. diseased tissue using an antibody specific for a post-translational modification of an amino acid (e.g., anti-phosphotyrosine).
Because of their relatively small protein complement, pathogens also are good substrates for MudPIT. A recent study compared the profiles of malarial proteins in all four stages of the parasite’s life cycle (Florens et al., 2002). Use of MudPIT enabled analysis of stages that require a human host with much less sample than DNA arrays require. The combined use of human, mosquito, and Plasmodium falciparum genome resources enabled parasite proteins to be distinguished from those of the host organism. Thus MudPIT can catalogue proteins in pathogens such as protozoa, bacteria, and viruses provided that their sequences (or those of similar organisms) appear in the databases. Clinically, disappearance of a protein in a new “resistant” strain could identify a mutated protein faster than genome sequencing.
A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases.
Sadygov RG, Yates JR III.
The Scripps Research Institute, La Jolla, CA, USA.
Anal Chem 75:3792-3798, Aug. 2003.
Summary: The authors developed a probabilistic model based on a hypergeometric distribution for determining the likelihood that protein identification based on a sequence match to a tandem mass spectrum is valid, by comparing the distribution of frequencies for the 500 closest fragment ion matches in the database, to the null hypothesis that the assignment is random. The smaller the probability of it being random, the greater the confidence of the match. Comparing a test restricted to 6,200 yeast sequences with another utilizing 907,646 NRP sequences demonstrated that the effect of database size on the algorithm is limited. The program, called PEP_PROBE, is written in java and can be run on a PC.
Nuclear membrane proteins with potential disease links found by subtractive proteomics.
Schirmer EC, Florens L, Guan T, Yates JR III, Gerace L.
The Scripps Research Institute, La Jolla, CA, USA.
Science 301:1380-1382, Sept. 2003.
Summary: A subtractive proteomics approach was applied to the nuclear envelope, from which the authors identified 67 novel putative transmembrane proteins that could potentially be mutated in at least thirteen human diseases (mostly dystrophies). Nuclear envelope targeting was confirmed in eight cloned and analyzed proteins. Each gene’s chromosomal location was determined and 23 were found to be located in large chromosome regions linked to several dystrophies, implicating them as potential disease genes.
A proteomics approach to understanding protein ubiquitination.
Peng J, Schwartz D, Elias JE, Thoreen CC, Cheng D, Marsischky G, Roelofs J, Finley D, Gygi SP.
Harvard Medical School, Boston, MA, USA.
Nat Biotechnol 21:921-926, Aug. 2003.
Summary: This report demonstrated that it is possible to utilize proteomics to identify protein modifications on a global scale. Yeast cells were modified to express a tagged ubiquitin construct. Proteins carrying the tag were purified from yeast lysates by affinity chromatography and analyzed by MudPIT. This revealed 1,075 potentially ubiquitin-conjugated proteins. Re-analysis of the tandem mass spectra by adding the mass of a single modified ubiquitin moiety to each predicted peptide in the database further confirmed the ubiquitination of 72 of these while identifying the site(s) of modification in the protein.
A proteomic view of the Plasmodium falciparum life cycle.
Florens L, et al.
Naval Medical Research Center, Silver Spring, MD, USA.
Nature 419:520-526, Oct. 2002.
Summary: The sensitivity of MudPIT enabled sampling of the protein composition at each of the four stages of the parasite life cycle for malaria infection. Over 2,400 Plasmodium falciparum proteins were identified in the four stages of the parasite life cycle. Alignment of the corresponding genes on chromosomes demonstrated clustering of the genes activated in each stage, suggesting that targeting gene-regulation mechanisms might be an effective anti-malarial therapy. Additionally, the profile of var and rif splice variants expressed on the surface of infected cells differed among the four stages, implicating these as potential stage-specific drug targets.
[Discovery Medicine, 2(18):38-39, 2003]