Veit Elser - Professor of Physics at Cornell University - On the eve of writing code, I search (the) void for a name

Submitted by ve10 on Sun, 05/04/2014 - 15:57

After three months of build-up, culminating in a simple and apparently reliable scheme for representing local structure constraints, I find I’m dragging my feet more than usual in setting down the first line of code. It’s probably not the complexity of the task in front of me, but the fact that the stakes are higher. I have enough experience with the constraint satisfaction algorithm, that if the code fails to fold proteins, it most likely will require revising, or completely scrapping, the local structure scheme.

I’ve had a fairly clear picture of the folding algorithm all the time I was working out the structure scheme. No, my struggles have not so much been about that, as what the solver would be called. To my relief, the name presented itself this morning in the form of a musical phrase. The primary sequence of a protein is like a musical score, a linear sequence of residue “motifs” that, when properly folded, produce a masterpiece of biological music. And because my four-residue motifs are simpler even than Wagner’s, I will call them “litemotifs”. My (mostly blank) source code file at least now has a name: litemotif.c.

Another development this week concerns the database. Up to now I’ve just been retrieving small pieces of the PDB, and it looked like extracting every bit of local structure information from this resource would be a major undertaking. In this I was proved wrong by Alex Alemi.

Monday morning I invited a few people familiar with “big data”, Alex among them, to throw around ideas on constructing the litemotif library. At one extreme was the proposal to write software that would build just the small library relevant for the particular primary sequence under consideration. That would mean accessing the PDB every time one wished to fold a new protein. At the other extreme one builds a single litemotif library that would serve any primary sequence, voiding future access to the PDB (aside from upgrades as the PDB grows). Alex quickly convinced us that the second option was feasible, proposing a combination of python code to process PDB files and something called SQLite for managing the database. To prove his point, over the course of the day he wrote all the software after getting a crash course on local structures from me. His code finished running overnight and Tuesday morning a 37 GB litemotif library was patiently waiting to serve up the secrets of protein structure!

The folding algorithm will follow the standard divide-and-concur construction. This approach was described previously for the case of sudoku. There we have a 9x9x9 “building” of indicator variables, where a 1 on “floor” z, in “cubicle” [x,y], corresponds to the number z at position [x,y] on the sudoku grid. The constraints are simple: just a single 1 (the rest 0) in every vertical column or horizontal “corridor” (along x and y and every floor). There is a fourth single-1 constraint that applies to every one of the nine “rooms” of 3x3 cubicles on every floor.

“Projecting” the sudoku building to any one of the four single-1 constraints is easy. On input the 9x9x9 numbers can be arbitrary real numbers. For the vertical column constraint each of the 9x9 columns is examined and the largest number is located and replaced by 1; the other numbers in the column are set to 0. This is a “projection” because the replacement rule minimizes the distance between input and output, when these are interpreted as real-valued vectors.

What makes sudoku hard is that the four types of projections are not compatible. A vertical column projection will in general have a different outcome than an x-corridor projection, and composing them creates tie-breaking issues. Divide and concur deals with this problem by replicating the variables, so there is one set per constraint. In the case of sudoku, the solver will work with four buildings of variables. There is a meta-projector of sorts, P_D, where each of the four buildings is projected according to its constraint (vertical columns, x-corridors, etc.). This is the “divide” projector.

Another meta-projector, P_C, ensures that the variables in all four buildings agree. As a projector, P_C replaces all four “replicas” of the building variable at [x,y,z] by the same value. The distance minimizing value, for the replacement, is simply the average of the four replicas. The output of the “concur” projector is four identical buildings of variables. Now the values are not just 1 and 0, but real numbers that reflect “a tendency toward 1” or “a tendency toward 0”.

Although computing P_C and P_D is very easy, individually they do not solve the sudoku puzzle. The magic sauce is the iterative scheme for combining these projectors to find a set of variables that is unchanged both by P_C and P_D. Such a set of variables would agree (concur) on all four buildings, and satisfy the (divided) constraint on each building. An iteration scheme that works well for very diverse constraint satisfaction problems will be the subject of a future post. It is through this general method that I propose to apply to proteins the same brand of logic that solves sudoku puzzles.

I promised last week to comment in this post on the possible failure mechanisms that threaten my folding algorithm (once it is written). This will have to wait until next week, where I explain the basics of the divide and concur implementation.