Multiple Indexing Sequence Alignment for Group Feature Extraction

Group Feature Extraction Based on Multiple Indexing Sequence Alignment 多重索引序列排比應用於群組特徵擷取 Dr. Tun-Wen Pai Dept. of Computer Science and Engineering, National Taiwan Ocean University 2006.10.30

Central idea: finding short approximate patterns • Motivation: finding ordered combinatorial features • Objectives: • constructing evolutionary relationship • providing key features for structural alignment

System Architecture

Motif finding • short consensus motifs including tolerable characteristics • variable-site tolerance: the tolerated sites in a pattern can be variable • substitutable tolerance: the similar chemical properties of residues in a pattern can be substituted

Variable-site tolerance • applying the uniqueness and efficient searching of hashing techniques • original patterns unique digital value • comparing patterns using a hash table structure

Substitutable tolerance • depending on chemical properties • substitution matrix Blosum62 • bitwise clustering avoid misjudging two dissimilar residues

Hierarchical clustering • revealing phylogenetic relationships • two sequences possess more consensus motifs more similar • scoring matrix pairwise similarities

Exclusive Group Feature Extraction • Removing common motifs occurring in other subgroups • CP: combinatorial patterns • ECP: exclusive combinatorial patterns

Background Model Analysis • Verifying conspicuousness • Hit ratio close to 0 unique • Hit ratio relative large insignificant

The combinatorial features of RNase A-like superfamily extracted by MISA

The combinatorial features of RNase A-like superfamily extracted by MISA(cont.) • The known H-K-H active sites are identified exactly

The combinatorial features of RNase A-like superfamily extracted by ClustalW • The first H was misaligned

The combinatorial features of RNase A-like superfamily extracted by ClustalW(cont.)

The combinatorial features of RNase A-like superfamily extracted by MEME

The combinatorial features of RNase A-like superfamily extracted by MEME(cont.) • The first ‘H’ was not successfully detected

The combinatorial features of RNase A-like superfamily extracted by Gibbs Sampler 1, 1, 1 65 qekvt CKNGQ gncyk 69 1.00 F 1E21:A 1, 2, 0 107 kerhi IVACE gspyv 111 1.00 F 1E21:A 1, 3, 2 116 egspy VPVHFD asved 121 1.00 F 1E21:A 2, 1, 1 38 nyqrr CKNQN tfllt 42 1.00 F 1GQV:A 2, 2, 0 109 anmfy IVACD nrdqr 113 1.00 F 1GQV:A 2, 3, 2 127 pqypv VPVHLD rii 132 1.00 F 1GQV:A 3, 1, 1 37 nyrwr CKNQN tflrt 41 1.00 F 1DYT:A 3, 2, 0 108 grrfy VVACD nrdpr 112 1.00 F 1DYT:A 3, 3, 2 125 prypv VPVHLD tti 130 1.00 F 1DYT:A 4, 1, 1 65 ttniq CKNGK mnche 69 1.00 F 1RNF:A 4, 2, 0 105 strrv VIACE gnpqv 109 1.00 F 1RNF:A 4, 3, 2 114 egnpq VPVHFD g 119 1.00 F 1RNF:A 5, 1, 1 59 kaice NKNGN phren 63 1.00 F 1B1I:A 5, 2, 0 104 gfrnv VVACE nglpv 108 1.00 F 1B1I:A 5, 3, 2 111 aceng LPVHLD qsifr 116 1.00 F 1B1I:A 15 motifs Column 1 : Sequence Number, Site Number Column 2 : Motif type Column 3 : Left End Location Column 4 : Motif Element Column 5 : Right End Location Column 6 : Probability of Element Column 7 : Forward Motif (F) or Reverse Complement (R) Column 8 : Sequence Description from Fast A input

The combinatorial features of RNase A-like superfamily extracted by Gibbs Sampler(cont.) • The first ‘H’ was not successfully detected • The motif colored in red wrong

The Comparison in Average RMSD and Aligned Residues (using a straight forward structure alignment) • The lowest average RMSD • The highest average aligned residues

MISA for primate map1b upstream sequences Ref: D. Liu and I. Fischer, “Structural analysis of the proximal region of the microtubule-associated protein 1B promoter”, J Neurochem, 1997, 69: pp. 910-919

MISA for primate hspa2

Hierarchical clustering for p450 family 1 It can be clustered into three subfamilites

Combinatorial features for subfamily 1A

Combinatorial features for subfamily 1B

Combinatorial features for subfamily 1C

cytochrome P450 subfamily 1A cytochrome P450 subfamily 1B cytochrome P450 subfamily 1C Exclusive group features for p450 family 1 • cytochrome P450 subfamily 1A • ^ E*L*A ^ *PK*L* ^ *W*ARR*LA* ^ L**FS ^ *SC*LEEH*S*E ^ G*F*P ^ *V*SV*NVI ^ *DF*P*LR*LP* ^ **EHY**F ^ **DIT**L ^ **ELD** ^ R*P*LS • cytochrome P450 subfamily 1B • ^ F*R*A ^ WK**R ^ R*F*T ^ **RYP**Q*R*Q ^ DQ**LP ^ G**NK*L* ^ **HQC** ^ **LLD** • cytochrome P450 subfamily 1C • ^ SI**EWSG**QPAL*A*F ^ **EAC*W* ^ F**YSKQW**HRK*AQS**RAFS*AN*QT* ^ EA**LV**FL ^ F*P*HE*T ^ N**FF**V**KV**HR ^ W**LL ^ *AK*RG*

Multiple Indexing Sequence Alignment for Group Feature Extraction

Multiple Indexing Sequence Alignment for Group Feature Extraction

Presentation Transcript

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment Based on Compact Set

Multiple Sequence Alignment

Multiple sequence alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment