/ basics

How does mass spectrometry-based, "shotgun proteomics" peptide sequencing work?

We get a lot of questions about how peptide sequencing works in a shotgun proteomics assay. Does it work like Edman degradation, proceeding from the N-terminus? Or is it more like DNA sequencing where the strand is sheared into random fragments, sequenced and assembled? Do you need a sequence library? What's the difference between this and de novo peptide sequencing? For that matter, what's the difference between a peptide and a protein?

That's a lot of ground to cover, so let's lay out the basic assumptions behind the phrase "shotgun proteomics."

  1. A mass spectrometer is being used to perform the measurements.
  2. A sample, consisting of anywhere from a single protein to the entire protein complement of a complex sample (think microbiome), is digested using a proteolytic enzyme to cleave the proteins at specific amino acid residues.
  3. There is a partial to complete genome sequence for the organism(s) being analyzed (RNAseq data will work, too).

A mass spectrometer measures mass. So why not just measure the protein masses? Why digest? To make the proteins bite-sized, as it were. There are three primary performance considerations driving the use of peptides instead of proteins, and of a specific enzyme to generate those peptides.

  1. High-performance liquid chromatography (HPLC, or nowadays, more commonly ultra-high or UPLC)
  2. Electrospray ionization
  3. Mass spectrometry

To measure the proteins in a sample you need to get those proteins into the mass spectrometer, obviously. But how? Proteins, relative to peptides, do not separate as well by UPLC, do not ionize as well and are not as easily measured or fragmented by the mass spectrometer. So, peptides are used instead. Specifically, tryptic peptides, derived from a digestion of the protein sample using trypsin. Preferably, sequencing-grade trypsin free of chymotrypsin contamination which would generate a different set of peptides. Trypsin generates peptides of about 10 amino acid residues in length by cleaving C-terminal to arginine and lysine residues. These peptides average a bit over 1,000 Da in mass. Crucially, at the low pH used in the UPLC mobile phase, they will carry two positive charges (aka doubly-charged), one at each end of the peptide, by virtue of the amine groups at the peptide N-terminus and on the C-terminal arginine or lysine side chain. This aids ionization and sequencing, as we'll see in a bit.

UPLC can resolve peptides at high efficiency, with long columns packed with 2 micron particles yielding peak widths measured in seconds for multi-hour separations. This allows the peptides to be fed into the mass spectrometer at a rate compatible with the acquisition speed of the instrument. The mass spectrometer separates along a different dimension - mass, or more specifically, mass divided by charge (m/z). That "average" peptide with two positive charges and a mass of about 1,000 Da will be measured by the mass spectrometer at about 500 m/z, ideally situated in the measurable mass range of the instrument. The very high resolution of modern mass spectrometers, together with the high resolution UPLC separation, enables the detection and measurement of well over 100,000 peptide species in a typical assay of a complex protein sample. But how many can we sequence? And, how are they sequenced? That's the point of this post, right?

Now we have all of the proteins in the mass spectrometer for measurement in the form of their constituent peptides. But one of the drawbacks (yes, there are more than one) of measuring peptides is that there are a lot more of them than there are proteins (which is another interesting topic, altogether[1]). And mass spectrometers can only acquire so many peptides at a time for sequencing. This number has been steadily improving and our new Q-Exactive HF-X can acquire 40 ions per second for sequencing and identify peptides at a rate of over 1,000 per minute! [2] That gets us about 30,000 peptides over the course of a 1 hour assay, still far short of the over 100,000 detectable peptides. An extensive fractionation of the sample (providing yet another peptide separation dimension) allows identification of over 100,000 peptides, using a bit over a day of instrument time. That is an amazing achievement, but a day is a long time to spend assaying a single sample.

Now that we know why peptides are being sequenced, let's get into the how.

In the traditional data-dependent acquisition mode (we'll cover data-independent acquisition in another post) the mass spectrometer acquires a packet of ions in a discrete scan event and measures the observable masses. From the detected masses the instrument picks out the most intense peptide ion and isolates it for fragmentation. In an Orbitrap mass spectrometer, fragmentation is performed using nitrogen gas. The peptide ion (which is actually a large population of molecules of that peptide ion) is isolated along with the nitrogen gas, which is in turn excited by an applied voltage. The nitrogen molecules bombard the peptide ion molecules, breaking them apart into fragments. The trick is getting the fragmentation tuned just so, applying enough energy so that you actually fragment the peptide ion, but not so much that you end up with single amino acid residues, which could at best tell you the amino acid composition of that peptide, not the sequence. A properly tuned fragmentation will yield a population of fragment ions, cleaved at each residue position of the peptide, generating a population of "daughter" peptide fragment ions of every possible (sequence - n) residues derived from the parent. The m/z read-out of this event is called a tandem mass spectrum. The fragmentation event is referred to as a MS/MS scan.

These daughter ions have their own nomenclature,[3] but we'll just focus on b and y ions. b ions are fragments of the parent peptide originating from the N-terminus. So, for a peptide of length n, the first amino acid is b1, the first and second amino acids are b2, the first, second and third amino acids are b3, etc. all the way up to the full peptide sequence minus the last amino acid, b(n-1). y ions are fragments of the parent peptide originating from the C-terminus. So, the last amino acid is y1, the last and the second to last amino acid is y2, etc. all the way up to the full peptide sequence minus the first amino acid, y(n-1). This gives two ladders of fragment ion masses. Sequencing is then as simple as walking up each ladder and measuring the difference between the "rungs" to determine which amino acid is at each position. This is termed de novo sequencing.

Of course, reality is rarely so simple. Which ladder is which and how can you tell them apart? Also, fragmentations are rarely ideal. Some amino acids, or strings of amino acids, fragment less well than others. Fragmentations can occur at other positions within the peptide, even cleaving off side chains. There are water and ammonia losses, potential amino acid modifications (such as acetylations and phosphorylations) and other complications that change the mass of the amino acid residues and make sequencing a challenge. Not the least of which is who is going to sit and interpret all of these tandem mass spectra being generated at a rate exceeding 100,000 per minute? That's where protein sequence libraries enter the picture.

Recall the third assumption of shotgun proteomic peptide sequencing: a genome sequence is available for the organism(s) being studied. Using the protein translation of the genome, a search algorithm performs an in silico digest of the protein sequence entries. In the case of trypsin, the algorithm will create a list of all peptides with a C-terminal arginine or lysine and an N-terminus of the amino acid residue immediately following the preceding arginine or lysine residue. For a given tandem mass spectrum, the algorithm looks at the mass of the peptide parent ion which was acquired for fragmentation and, from the list of possible tryptic peptides, selects those with masses within a set tolerance of the measured parent ion mass. From this subset with similar parent ion masses, the algorithm generates theoretical fragment mass spectra for each peptide. Then, the algorithm looks for the best match between these theoretical spectra and the observed, experimental tandem mass spectrum. In general, the best match "wins." This approach was first put into practice in the SEQUEST algorithm, [4] developed by John Yates and Jimmy Eng (who later developed the Comet algorithm [5]).

This sequence library-based method effectively is shotgun proteomics. At least, until recently. Other shotgun-based methods, such as de novo sequencing and spectral-library-based matching, are not nearly as widespread. And spectral libraries are derived directly from sequence library-based data. But spectral libraries have enabled a new type of shotgun proteomics, based on data-independent acquisition, which looks likely to eventually overtake data-dependent acquisition for many applications.

  1. How many human proteoforms are there? ↩︎

  2. Performance evaluation of the Q Exactive HF-X for shotgun proteomics ↩︎

  3. Proposal for a common nomenclature for sequence ions in mass spectra of peptides ↩︎

  4. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database ↩︎

  5. Comet: an open-source MS/MS sequence database search tool ↩︎