Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan

Motivation • SRS • Sample conformations • Create edges between “neighboring” conformations • Ab-initio structure prediction • Generate a large decoy set • Cluster based on similarity When the number of conformations is large, finding neighboring (similar) conformations is costly

Similarity Measures • Given the backbone Cα atom positions of two conformations – how similar are they? • Hard to define when comparing two different proteins • Straightforward when comparing two conformations of the same protein.

Similarity Measures • We are interested in comparing conformations of the same protein. • Hence - trivial correspondence between the two point sets. • The two most common measures are: • cRMS deviation • dRMS deviation

cRMS T is the rigid body transform that optimally aligns P and Q • cRMS is a metric, but the space is not Euclidean • There is a closed form solution for T • Complexity is linear in the number of points (plus a 4x4 eigenvectors computation)

dRMS • A metric over a Euclidean space. • Complexity is quadratic in the number of points (size of protein) D is the internal distances matrix:

k Nearest Neighbors • Find the k nearest neighbors of every conformation in the set • Currently the fastest algorithm in practice for high dimensionality is brute force: For each conformation q in set Compute distance to all other conformations Find the k nearest conformations • Complexity is O(n2 log k)

k Nearest Neighbors • The literature has a number of efficient nearest neighbor algorithms: • kd-trees is the most prevalent • We cannot use these algorithms: • Require a Euclidean space – cRMS • Not efficient with high dimensionality - dRMS We reduce the dimensionality of dRMS to make kd-trees applicable.

Uniform Simplification • Cut sequence into m equal subsequences • Average the coordinates of the Cα atoms in each subsequence • Use averaged coordinates ai when computing cRMS and dRMS a3 a6 a0 am a5 a1 a2 a4

Uniform Simplification - Results • There is a high correlation between the full and the averaged representation when using cRMS and dRMS: • Proteins with 60 – 75AA: r > 0.95 for m > 12 • Protein with 374 AA: r > 0.95 for m > 16 Even with m = 12, the dimensionality of the internal distances matrix used by dRMS is too high (66) for a kd-tree to be used. Further reduction is needed.

Proteins 4PTI (58) 1CTF (68) 1R69 (63) 1HTB (374)

Further Reduction using SVD • We Apply SVD to the reduced distance matrices (stacked as vectors) • We project the reduced matrices onto the important singular vectors to further reduce the size.

Further Reduction – Results. • Averaging before creating internal distances vector makes SVD feasible • For proteins with 60-75 AA, dRMS using only 20 parameters was highly correlated (r > 0.90) with dRMS using full representation. • 20 Dimensions is not too much for kd-trees.

Finding k Nearest Neighbors • We tested the actual ability of the reduced representation to find NNs • 80 of the 100 true NNs (using dRMS) where found using the reduced rep. of decoy sets • Results are better (90) when the data set contains uniformly sampled conformations • The maximal relative error was 10% - 20% (0.5Å – 1.5Å) • The average relative error was < 5%

Using kd-trees • We used the ANN implementation (UMD kd-tree software). • The data set contained 100,000 conformations. • We want to find 100 NN for each conformation.

Why Does Averaging Work? • The mean distance of the i’th point from the origin is O(N0.5) and its stdev is also O(N0.5). • There is very high corr. between dRMS using the full distances vector and using only distances between “highly” separated points • The amount of distortion added by averaging has a mean of 0 and stdev of O(n0.5)

Conjecture: The important differences between two conformations are found in the distances between “highly” separated points. These distances are large and therefore only distorted a little by averaging

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations