Protein folding pregame

It's official: the much anticipated research begins next week!

In this post I'll describe what that first baby step will be: local structure prediction. What I call "local structure" is a meld of two things that have been the subject of intense study: secondary structure, and tertiary contacts. But first a bit of background for non-specialists.

Proteins are comprised of a polymer of identical peptide units to which are linked side chain units that come in 20 varieties called amino acid residues, or simply residues. The peptide-polymer, minus the side chains, is called the "backbone". Predicting the shape of the backbone, or "fold", given the sequence of residues, is the folding problem. An electrical engineer would look at this as an example of digital-to-analog coding: a digital residue sequence gets mapped to an analog set of positions in space (the fold). The backbone shape is completely specified by the positions of the set of α-carbon atoms, of which there is one per peptide. From the α-carbon positions it's relatively easy to get the rest of the protein structure by molecular simulation. My representation of protein structures will therefore always be at this slightly abstracted level: a linked chain of α-carbons.

Secondary structure refers to the prevalence of relatively few local motifs in the backbone. The two most common secondary structure motifs are the alpha-helix and beta-sheet. Repeated alpha-helix motifs make the backbone twist into a helix; repeats of the beta-sheet motif form straight chains that then form sheets when lined-up side-by-side.

The next level of protein structure, or tertiary structure, is often focused on another local property: contacts the backbone forms with itself. "Contact" is literally the molecular adjacency of chemical groups, usually the residues. Since proteins in their folded state are very compact, a large fraction of the residues make contacts with other residues; those that do not make contact with the surrounding solvent. At my level of abstraction, each α-carbon makes contact with some other α-carbon or else avoids other α-carbons, in the case of a solvent contact.

As far as I know, these two manifestations of local structure have been studied as separate prediction problems. For example, given the residue subsequence MDAKA (M=methionine, D=aspartic acid, A=alanine, K=lysine), a correct prediction (because this occurs in a known protein) is "alpha-helix" for the local secondary structure and "solvent" for the contact of the central A residue. Besides integrating these two parts of local structure prediction, there are a number of other things I propose that are a departure from standard practice. These changes address the technique I plan to use for satisfying all the constraints encoded in the local structures.

Now is probably a good time to say something about the folding algorithm. Suppose I wish to fold the protein 2YGS -- a kind of Sith Lord protein that, according to PDB (Protein Data Bank), has a role in "programmed cell death". 2YGS has residue sequence:

MDAKARNCLLQHREALEKDIKTSYIMDHMISDGFLTISEEEKVRNEPTQQQRAAMLIKMILKKDNDSYVSFYNALLHEGYKDLAALLHDGIP

From my local structure predictions I will have constraints on how each short piece of the backbone wants to bend and twist, and how it likes to make contact with other parts of the backbone. Even when all these predictions are correct, satisfying the corresponding constraints is not a walk in the park!

Choices arise in a number of ways. First, to ensure the local structure constraints are correct, many of the "predictions" will not be a unique local structure but a set of candidates. My secondary structure and contact prediction success will not be scored by the fraction I get correct, as is the usual practice, but by how few candidates I can get away with so the correct one is always among my choices. Another choice is the position along the backbone of the contact residue. Lysine (L), for example, appears 11 times in 2YGS; if the contact prediction for one of my subsequences is L, there are 11 choices for that particular constraint. Finally, secondary structure is analog and suffers from imprecision, the most severe being when there is a switch between helix and sheet domains. All of this seems very discouraging until you recall the Lilo puzzle of the previous post. There, despite the many choices on a local level, only one placement of the tiles satisfied all the constraints (four if you count rotations of the whole, but that ambiguity applies to proteins too).

The algorithm I will use to satisfy all the local constraints is called the difference map. Here is a movie I prepared a few years ago that shows how it folds 2YGS:

The green balls are the α-carbons. You can see that the secondary structure is mostly helices. This was probably an easy instance of protein folding, and in any case it falls short of a true demonstration of my method. The shortcoming is the fact that I didn't allow for the possibility that the inventory of local structures in the current PDB is incomplete.

There are two ways the local structure motifs in the PDB is incomplete to some degree. First, it's very likely that the next protein we try to fold will have a residue subsequence (say of length 5 if that's how we define "local") that is nowhere to be found in the PDB. Second, even if we find examples of that subsequence in the PDB, it might be that in the new protein it adopts a different structure, or makes contact with a different residue.

So that's my first challenge: learning how to generalize from the local structure data in the current PDB so it can be applied to any protein that Nature has devised.