Archive for March, 2011

James Arvo wrote a note on sampling from the Haar measure on SO(3) using the following prescription.  Choose x1 from uniform[0,1] and x2 and x3 from uniform[-1,1] and let R = [ cos(2 pi x1) sin(2pi x1) 0; -sin(2 pi x1) cos(2 pi x1) 0; 0 0 1] and let v = [ cos(2 pi x2) sqrt(x3); sin(2pi x2) sqrt(x3); sqrt(1-x3)], and H= I – 2 v v^T.  Then M = -HR is sampled from the Haar measure on SO(3) as a matrix.

Now suppose f(x) is a density with respect to the Haar measure on SO(3).  We can sample from the uniform distribution as above, and also choose u from uniform[0,1] and if u < f(x) accept x as a sample from f(x) and reject it otherwise.  This is the standard rejection sampling method.

Given a sample x1, …, xN of data points on SO(3) we can estimate f(x) using a variety of density estimation techniques, for example Pelletier’s density estimation which allows us to evaluate a smooth density based on the data, or we can use a maximum entropy density for example.  The Pelletier formalism for a given x uses the riemannian distance function from x to each of the data points.

The data points of interest to us is the relative twists pivoted on amino acids that occurs from the twist decomposition of approximately 30,000 protein structures.  The above considerations can be used to produce statistical models of twist sequences for proteins.

Read Full Post »

In a note posted early February I had posted some results showing that if we take folded protein shapes from the structure database, pick out the positions of the Nitrogen atoms along the protein backbone, and then decompose the shape into a sequence of SO(3) twists, ignoring the scaling differences, then these twists are highly concentrated in the space of rotations SO(3).  This twist decomposition of proteins provide a natural arena for the application of statistical methods that have been used successfully previously in statistical machine translation and speech recognition problems.

Two related methods that are used in statistical machine translation, for example, are maximum likelihood and maximum entropy.  The simple idea is the follows:  Quite generally, we have a sequence of data points (x1, y1), …, (xN, yN) where y are the output and x are “influence” variables.  In machine translation of French to English, for example, x could represent French words and y the corresponding English words.  In the case of the protein folding problem, y will represent twists in SO(3) and x can represent the influencing factors, for example specific amino acids and previous twists.

We can express any statistic of the sample as the expected value of an appropriate binary valued indicator function f.  We call such a function a feature function.  We can summarize the training sample in terms of its empirical probability distribution EP which is

EP(x,y) = 1/N x number of times (x,y) occurs in the sample.

The entropy is a measure of uniformity of a probability distribution and maximizing the entropy among probability densities that fit the sample

H(p) = sum over x,y EP(x) P(y | x) log P(y | x)

gives a way of achieving an “Ockham’s razor” probability density that fit the sample.  This description is quite abstract and parametric probability densities have specific techniques for maximizing entropy in practice, as described, for example, in “A Maximum Entropy Approach to Natural Language Processing,” Berger, Della Pietra and Della Pietra 1996.


Read Full Post »