Sequence Clustering Algorithm for Similarity Analysis

Overview • Describe problem • Old algorithm and results • Describe new algorithm • Nuts and bolts • examples

Overview of task • Basic task is to cluster a set of sequences by sequence similarity. • Run blast matrix on all sequences. • Parse results identifying and removing “repeats” and flagging potential chimeras • Output to a file a summary of results containing (identifier:pValue:match length: %identity:consistentEnds) >411435: (86852794:1.0e-313:643:99:1, 91812944:2.2e-161:324:100:0) • Build clusters from file.

Old Algorithm Connected components analysis…all things put into cluster that are connected… threshhold for making “genes” is 150 bp length, 97% identity. -- A -> B and B -> C then A, B, and C in cluster -- Break mega clusters by increased similarity requirements. 150 bp, 97% identity Summary of distribution 37809: 1 1086: 1 741: 1 730: 1 595: 1 7: 882 6: 1001 5: 1251 4: 1782 3: 3101 2: 10020 1: 563655 Mega - 200bp, 98% Summary of distribution 2761: 1 719: 1 553: 1 480: 1 398: 1 7: 23 6: 22 5: 29 4: 21 3: 65 2: 226 1: 11799 New 150bp, 97% Summary of distribution 1073: 1 694: 1 425: 1 397: 1 352: 1 7: 1321 6: 1508 5: 1379 4: 1969 3: 3101 2: 10020 1: 563655

Cluster_82 (212 sequences): 33 assemblies: No NR protein Similarities 31 assemblies: (AB022209) ribonucleoprotein F [Rattus norvegicus] 22 assemblies: CTLA-2-BETA PROTEIN PRECURSOR 13 assemblies: 60S RIBOSOMAL PROTEIN L8 11 assemblies: 40S RIBOSOMAL PROTEIN S25 8 assemblies: (XM_038141) hypothetical protein XP_038141 [Homo sapiens] 8 assemblies: (D13628) KIAA0003 [Homo sapiens] 7 assemblies: ORPHAN NUCLEAR RECEPTOR NURR1 (NUR-RELATED FACTOR 1) 6 assemblies: (BC000047) Similar to ribosomal protein L8 [Homo sapiens] 6 assemblies: nuclear orphan receptor HZF-3 - rat 6 assemblies: (BC012197) Similar to ribosomal protein L8 [Homo sapiens] 6 assemblies: ORPHAN NUCLEAR RECEPTOR NURR1 (NUR-RELATED FACTOR 1) (REGENERATING LIVER NUCLEAR RECEPTOR 1) (RNR-1) (SL-322) (NUCLEAR ORPHAN RECEPTOR HZF-3) 5 assemblies: DUAL SPECIFICITY PROTEIN PHOSPHATASE 2 (DUAL SPECIFICITY PROTEIN PHOSPHATASE PAC-1) 5 assemblies: ANGIOPOIETIN-1 PRECURSOR (ANG-1) 5 assemblies: UBIQUITIN-CONJUGATING ENZYME E2-21 KD UBCH6 (UBIQUITIN-PROTEIN LIGASE) (UBIQUITIN CARRIER PROTEIN) 4 assemblies: (U72345) orphan nuclear hormone receptor [Rattus norvegicus] 4 assemblies: HETEROGENEOUS NUCLEAR RIBONUCLEOPROTEIN F (HNRNP F) 3 assemblies: (AJ278700) nuclear receptor related 1 [Oryzias latipes] 3 assemblies: (AK056456) unnamed protein product [Homo sapiens] 2 assemblies: gene TINUR protein - human 2 assemblies: hypothetical protein F13M23.40 - Arabidopsis thaliana 2 assemblies: PROTEIN PHOSPHATASE 2C ALPHA ISOFORM (PP2C-ALPHA) (IA) (PROTEIN PHOSPHATASE 1A) 2 assemblies: (XM_050500) hypothetical protein XP_050500 [Homo sapiens] 2 assemblies: (XM_035338) similar to ribosomal protein L8 (H. sapiens) [Homo sapiens] 2 assemblies: (BC016736) heterogeneous nuclear ribonucleoprotein F [Homo sapiens] 2 assemblies: (AK017138) putative [Mus musculus] 2 assemblies: thyroid/steroid receptor homolog RNR-1 - rat 1 assemblies: HOMEOBOX PROTEIN NKX-2.3 1 assemblies: (AF259673) protein phosphatase 2C alpha 1b [Mus musculus] 1 assemblies: ORPHAN NUCLEAR RECEPTOR NURR1 (IMMEDIATE-EARLY RESPONSE PROTEIN NOT) (TRANSCRIPTIONALLY INDUCIBLE NUCLEAR RECEPTOR) 1 assemblies: Mus musculus magnesium dependent protein phosphatase A1 mRNA 1 assemblies: (AK012591) putative [Mus musculus] 1 assemblies: (AF259672) protein phosphatase 2C alpha 3 [Mus musculus] 1 assemblies: (AF202036) homeobox transcription factor NKX2-3 [Mus musculus] 1 assemblies: (AJ414568) ribosomal protein L8 [Paracentrotus lividus] 1 assemblies: (NM_010090) dual specificity phosphatase 2 [Mus musculus] 1 assemblies: (D50080) lamin B1 [Mus musculus] The example that started it all…thanks to Crabtree!

New Algorithm • Graph based…represent sequences as nodes and edges as similarities that meet the cutoff (for example 150 bp and 97% identity) • Build cliques not trying to make best cliques (NP complete) but rather avoiding worst….look ahead 1 step. • Clique is set of nodes, all of which have edges to all others. • Merge cliques starting with smallest and working up based on linking edges again avoiding worst choice when have multiple choices. • Merged cliques no longer have edges between all members. • Merged cliques represent the clusters I am generating.

Example graphs…

Making cliq ues • For each node in the graph, put that node into the clique that is (largest|smallest) into which it can fit. • (draw example…) • Try to avoid making bad choices when have multiple options… • “grow cliques” when finished…go through cliques and merge cliques if can create a valid clique that is larger. • When making cliques trying to create the largest, then am never able to grow cliques, when making cliques from smallest, then can grow cliques. Final distribution is similar but not identical. • Have not focused on this part of the problem…perhaps that has been a mistake…ideas anyone??

Merging cliques…core of the problem. • If we require only a single edge to merge cliques then have connected components. • How to determine the number of edges to require? • I have opted for using what I call a logBase (lb). In order to merge cliques, the following must be met: • Number of links >= int(loglb(minCliqueSize)) + 1 • Number of links = minimum number of nodes in clique with edges to other clique. (draw this..next slide) • minCliqueSize = number of nodes in the smallest clique • lb = any number > 1 • If there are multiple possible merges of same size, try to pick the best one (at least NOT the worst).

Merging Cliques lb=1.5 1: 1 2: 2 3: 3 4: 4 5: 4 6: 5 7: 5 8: 6 9: 6 10: 6 20: 8 50: 10 100: 12 lb=2 1: 1 2: 2 3: 2 4: 3 5: 3 6: 3 7: 3 8: 4 9: 4 10: 4 20: 5 50: 6 100: 7 lb=3 1: 1 2: 1 3: 2 4: 2 5: 2 6: 2 7: 2 8: 2 9: 3 10: 3 20: 3 50: 4 100: 5 lb=6 1: 1 2: 1 3: 1 4: 1 5: 1 6: 2 7: 2 8: 2 9: 2 10: 2 20: 2 50: 3 100: 3 minCl=50 lb: val 2: 6 3: 4 4: 3 5: 3 6: 3 7: 3 8: 2 9: 2 10: 2

More specifics • First make cliques and then “grow” them as described. • Currently sorting graph small to large and adding to cliques small to large • Merge cliques always small to large and cliques may participate in only one merge event / pass. Keep iterating till have 0 merge events in pass. • First merge using logbase 1.5 to do very stringent merge. • Then merge at user specified logbase. • Optionally can set threshhold cutoffs for accepting cliques, keep things smaller than this cutoff and redo analysis with more stringent logbase.

Command parameters USAGE: buildBlastClusters.pl --lengthCutoff=i ##length cutoff for similarities to create an edge in graph --percentCutoff=i ##percent cutoff for similarities to create an edge in graph --consistentEnds! ##ends must be consistent if true to create an edge in graph --chimeraFile=<filename> ##file containing chimeras identified by blast matrix...will be removed from graph --ignoreFile <filename of sequences to ignore> ##file of identifiers to NOT put into graph --verbose! ##verbose output --files '<matrix, files>' ##matrix files from blast matrix (can be gzipped or not) --debug! ##lots of debugging output --logBase=f ##logBase to use for determining linkages (NOTE: logBase 1.5 is always used initially to merge small clusters prior to using this logBase value in second pass --logBaseMax=f ##like logBase but increments up by ints to the logBaseMax from logBase 2 (ie, if 4 then does 3 then 4). --iterateCliqueSize=i ##cutoff size of cliques to accept in first iteration. Nodes from these cliques will be removed and the merging re-run using iterateLogBase as logBase value. --iterateLogBase=f ## value for logBase to use for large clusters...should be less that logBase! --iterateLogBaseMax=f ##like logBaseMax but for iteration --iterateDescending! ##if true then iterates the iteration stating with logBase - 1 down to --minIterateLogBase --minIterateLogBase=f [2] ##if --iterateDescending, determines minimum logBase to use --useAllLinks! ##use total links between cliques for determining best clique to merge into --useCloneIds=i ##uses clone_ids when clusters don't meet logBase but have links --sort! ##sort output by cluster size (note if iterating then will get two sorts..first smaller than iterateCliqueSize and then the iterated cliques. --help ##prints this usage statement

Sequence Clustering Algorithm for Similarity Analysis

Sequence Clustering Algorithm for Similarity Analysis

Presentation Transcript

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

OVERVIEW

Overview

Overview

OVERVIEW

Overview

Overview

Overview