320 likes | 450 Vues
This research focuses on extracting combinatorial features from biological sequences using advanced multiple indexing sequence alignment techniques. The objective is to identify short approximate patterns and construct evolutionary relationships based on these findings. Key elements include constructing motifs that incorporate tolerated characteristics, using hashing techniques for efficient pattern searching, and applying substitution matrices like Blosum62 to ensure accuracy in comparisons. This innovative approach reveals significant phylogenetic relationships and enhances our understanding of sequence alignments.
E N D
Group Feature Extraction Based on Multiple Indexing Sequence Alignment 多重索引序列排比應用於群組特徵擷取 Dr. Tun-Wen Pai Dept. of Computer Science and Engineering, National Taiwan Ocean University 2006.10.30
Central idea: finding short approximate patterns • Motivation: finding ordered combinatorial features • Objectives: • constructing evolutionary relationship • providing key features for structural alignment
Motif finding • short consensus motifs including tolerable characteristics • variable-site tolerance: the tolerated sites in a pattern can be variable • substitutable tolerance: the similar chemical properties of residues in a pattern can be substituted
Variable-site tolerance • applying the uniqueness and efficient searching of hashing techniques • original patterns unique digital value • comparing patterns using a hash table structure
Substitutable tolerance • depending on chemical properties • substitution matrix Blosum62 • bitwise clustering avoid misjudging two dissimilar residues
Hierarchical clustering • revealing phylogenetic relationships • two sequences possess more consensus motifs more similar • scoring matrix pairwise similarities
Exclusive Group Feature Extraction • Removing common motifs occurring in other subgroups • CP: combinatorial patterns • ECP: exclusive combinatorial patterns
Background Model Analysis • Verifying conspicuousness • Hit ratio close to 0 unique • Hit ratio relative large insignificant
The combinatorial features of RNase A-like superfamily extracted by MISA
The combinatorial features of RNase A-like superfamily extracted by MISA(cont.) • The known H-K-H active sites are identified exactly
The combinatorial features of RNase A-like superfamily extracted by ClustalW • The first H was misaligned
The combinatorial features of RNase A-like superfamily extracted by ClustalW • The first H was misaligned
The combinatorial features of RNase A-like superfamily extracted by ClustalW(cont.)
The combinatorial features of RNase A-like superfamily extracted by MEME
The combinatorial features of RNase A-like superfamily extracted by MEME(cont.) • The first ‘H’ was not successfully detected
The combinatorial features of RNase A-like superfamily extracted by Gibbs Sampler 1, 1, 1 65 qekvt CKNGQ gncyk 69 1.00 F 1E21:A 1, 2, 0 107 kerhi IVACE gspyv 111 1.00 F 1E21:A 1, 3, 2 116 egspy VPVHFD asved 121 1.00 F 1E21:A 2, 1, 1 38 nyqrr CKNQN tfllt 42 1.00 F 1GQV:A 2, 2, 0 109 anmfy IVACD nrdqr 113 1.00 F 1GQV:A 2, 3, 2 127 pqypv VPVHLD rii 132 1.00 F 1GQV:A 3, 1, 1 37 nyrwr CKNQN tflrt 41 1.00 F 1DYT:A 3, 2, 0 108 grrfy VVACD nrdpr 112 1.00 F 1DYT:A 3, 3, 2 125 prypv VPVHLD tti 130 1.00 F 1DYT:A 4, 1, 1 65 ttniq CKNGK mnche 69 1.00 F 1RNF:A 4, 2, 0 105 strrv VIACE gnpqv 109 1.00 F 1RNF:A 4, 3, 2 114 egnpq VPVHFD g 119 1.00 F 1RNF:A 5, 1, 1 59 kaice NKNGN phren 63 1.00 F 1B1I:A 5, 2, 0 104 gfrnv VVACE nglpv 108 1.00 F 1B1I:A 5, 3, 2 111 aceng LPVHLD qsifr 116 1.00 F 1B1I:A 15 motifs Column 1 : Sequence Number, Site Number Column 2 : Motif type Column 3 : Left End Location Column 4 : Motif Element Column 5 : Right End Location Column 6 : Probability of Element Column 7 : Forward Motif (F) or Reverse Complement (R) Column 8 : Sequence Description from Fast A input
The combinatorial features of RNase A-like superfamily extracted by Gibbs Sampler(cont.) • The first ‘H’ was not successfully detected • The motif colored in red wrong
The Comparison in Average RMSD and Aligned Residues (using a straight forward structure alignment) • The lowest average RMSD • The highest average aligned residues
MISA for primate map1b upstream sequences Ref: D. Liu and I. Fischer, “Structural analysis of the proximal region of the microtubule-associated protein 1B promoter”, J Neurochem, 1997, 69: pp. 910-919
Hierarchical clustering for p450 family 1 It can be clustered into three subfamilites
cytochrome P450 subfamily 1A cytochrome P450 subfamily 1B cytochrome P450 subfamily 1C Exclusive group features for p450 family 1 • cytochrome P450 subfamily 1A • ^ E*L*A ^ *PK*L* ^ *W*ARR*LA* ^ L**FS ^ *SC*LEEH*S*E ^ G*F*P ^ *V*SV*NVI ^ *DF*P*LR*LP* ^ **EHY**F ^ **DIT**L ^ **ELD** ^ R*P*LS • cytochrome P450 subfamily 1B • ^ F*R*A ^ WK**R ^ R*F*T ^ **RYP**Q*R*Q ^ DQ**LP ^ G**NK*L* ^ **HQC** ^ **LLD** • cytochrome P450 subfamily 1C • ^ SI**EWSG**QPAL*A*F ^ **EAC*W* ^ F**YSKQW**HRK*AQS**RAFS*AN*QT* ^ EA**LV**FL ^ F*P*HE*T ^ N**FF**V**KV**HR ^ W**LL ^ *AK*RG*