1 / 64

Genomic Sequence alignments and its application

Genomic Sequence alignments and its application. 조환규 교수 부산대학교 공과대학 정보 컴퓨터 공학부 Hgcho@pusan.ac.kr. Biology and Informatics. Mathematics : Physics = X : Biology X = ? Bioinformatics Understanding Biological System with Informatics Molecular Biology Computational Biology Genomics

Télécharger la présentation

Genomic Sequence alignments and its application

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomic Sequence alignments and its application 조환규 교수 부산대학교 공과대학 정보 컴퓨터 공학부 Hgcho@pusan.ac.kr

  2. Biology and Informatics • Mathematics : Physics = X : Biology • X = ? • Bioinformatics • Understanding Biological System with Informatics • Molecular Biology • Computational Biology • Genomics • Proteomics • And several –omics

  3. Main features of This Talk • 이미 잘 정리된 Computing methodology를 어떻게 Bioinformatics에서 활용하는가 ? • Bioinformatics에서 잘 정리된 방법론은 CS쪽에서 어떻게 활용하는가 ? • Case Study • Genomic sequencing alignment와 program-copy detection과의 연관

  4. Computing Space Transform • Normal Space Jewelry Space PLUS MINUS

  5. Computing Space Transform(2) • Normal Space log Space log X , Y log X , log y multiply exponent log x + log y X * Y

  6. Computing Space Transform(3) • Program Space Genome Seq Space Program b Program a Protein a Protein b Basic keyword CLUSTAL-W( a, b ) Pairwise Alignment Similarity( , ) Similarity( , )

  7. Namely “우물론” • 한 우물을 팔 것인가 ? • 그러나 만일 끝끝내 물이 안 나올 경우라면 • 여러 우물을 돌아가며 조금씩 팔 것인가 ? • 이것도 저것도 아니라면 ? • 그렇다면 어떻게 팔 것인가 ? • Avoid “reinventing wheel”

  8. Genomic Sequence Alignments • Genomic Sequences, linear • DNA, RNA, Protein(amino acids) • Why linear ? • Goal of Molecular Biology or Life Science • Characterizing functions of genes • Understanding internal gene interactions • Understanding internal & external interactions • Drug targeting(protein interaction) • Why Alignment ?

  9. Human Binome BANK in 3280 • Year 3280, dooms day • So many binary data files • Chips( cell-phone, cooker, TV… et al) • Computer disks • Figure out the contents of followings • 010111010101000100101000000001010101010….? • 100000001010111111111001010101010010101…? • 101011111111111110001010010100101010100…?

  10. Human-Binome-Project • They decided to establish Binary BANK. • Some are partially annotated • Object code, text data , garbage • Protein , Gene , Junk-DNA • HUMAN-Binome PROJECT….. Starts… • 목적: 각종 binary sequence의 기능을 탐색 • Mini-binary project ~~ E.Coli, C.Elegans • Cell-phone, calculator, PDA… • Full sequencing of 300 Giga bytes DISK • HUMAN BINOME BANK(HBB)!

  11. For an Unknown Seq. X • Sequence X from a hardware • Several error bits included • Fragment sequencing • Find a similar pattern in HBB • Function ? • Region ? • Size ? • Write a BIG paper…. • Here it comes………

  12. 반도지역 SM-6-5에서 흔히 발견되는Seq-45-6-X 종의 기능에 대한 고찰 팔봉이, 2급 원숭이 우주대학, 분자고생물학과 이번에 SM-6-5지역에서 빈번하게 발견된 유전종들의 기능은 …. 아마도 이 부족들이 집중적으로 사용한 언어를 표기하기 위한 도구의 일부로 추정된다. 또한 이들은 매우 다양한 표기형식을 가지고 있는 것으로 추정되는데, 그 이유는 각 단어들이 나타나는 빈도에서 추정되는 엔트로피가 PR-99지역에 나타나는 약 26개보다 월등히 많은 곳으로 추정되므로 ……………………. Journal of Science & Sciences No.187 (Vol. 213), pp.232-265, 3288

  13. Dynamic Programming • A Basic Methodology • For all kinds of alignment • Solution from all sub-partial solution • 준비물 • Objective function • Dynamic programming formula(recursion) • F(n) = F(n-1) + F(n-2), • Base condition, F(0)=F(1)=1 • Table, multi-dim. array structure

  14. Global Alignment(1) • Basic scoring: • Match: 1, Mismatch: -1, Space: -2 • How? • To find the alignment of two sequences of maximal score • Sequence alignment problem corresponds to the longest path problem form the source to the sink in this directed acyclic graph.

  15. 1 -1 -3 -5 -7 -9 -11 -13 -1 2 0 -2 -4 -6 -8 -10 -3 0 1 -1 -1 -3 -5 -7 -5 -2 -1 0 0 -2 -2 -4 -7 -4 -3 -2 -1 1 -1 -1 C A C A G T G T C A - - G - G T Global Alignment(2) • CACAGTGT 와 CAGGT C A C A G T G T 0 -2 -4 -6 -8 -10 -12 -14 -16 C -2 A -4 G -6 G -8 T -10

  16. Local Alignment(1) • An alignment between a substring of s and a substring of t • Each entry of (I,j) will hold the highest score of an alignment between a suffix of s[1…i] and a suffix of t[1…j] 예) AGGTATTGA -CCTATGGC

  17. 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 2 0 0 2 0 0 0 1 0 3 2 0 0 1 1 0 0 1 2 1 0 0 0 0 0 0 0 1 A G G T A T T A - - C T A T G C Local Alignment(2) • AGGTATTG 와 CTATGC A G G T A T T A 0 0 0 0 0 0 0 0 0 C 0 T 0 A 0 T 0 G 0 C 0

  18. Semi-global Alignment(1) • Given two sequences, check if one of them has a substring similar to the other entire sequence. • How? • Find alignments ignoring the beginning and end spaces of the sequences • Global alignment 와 비교 • CAGCA -CTTGGATTCTCGG <-semi-global • - - -CAGCGTGG- - - - - - - (score: -19) • CAGCACTTGGATTCTCGG <-global • CAGC- - - - -G -T- - - -GG (score: -12)

  19. 1 -1 -3 -5 -7 -9 -11 -13 -1 2 0 -2 -4 -6 -8 -10 -3 0 1 -1 -1 -3 -5 -7 -5 -2 -1 0 0 -2 -2 -4 -7 -4 -3 -2 -1 1 -1 -1 C A C A G T G T - - - C A G G T - - Semi-global Alignment(2) • CACAGTGT 와 CAGGT C A C A G T G T 0 -2 -4 -6 -8 -10 -12 -14 -16 C -2 A -4 G -6 G -8 T -10

  20. General Gap Penalty • Definition • Gap: consecutive number k > 1 of spaces • When mutations are involved, the occurrence of a gap with k spaces is more probable than the occurrence of k isolated spaces • w(k) : penalty associated with a gap with k spaces

  21. Affine Gap Penalty Function • Penalty for consecutive spaces <= isolated spaces • Sub-additive function • w(k1 +k2+…_kn) <= w(k1) + w(k2) +…+w(kn) • Three arrays for dynamic programming • a[i,j] = maximum score of an alignment between s[1…i] and t[1…i] that ends in s[i] matched with t[j] • b[i,j] = maximum score of an alignment between s[1…i] and t[1…i] that ends in a space matched with t[j] • c[i,j] = maximum score of an alignment between s[1…i] and t[1…i] that ends in s[i] matched with a space

  22. Heuristic Alignment • Main difficulties • Search space, O(n^2) space or O(n^2 log n)time • Optimality or Biologically-good Distance metric • Multiple alignment • Local search • Diagonal region searching • Visualization., e.g., Dotlet • BLAST approach for long sequence • Small word matching • And Extending from a highly matched region

  23. Multiple Alignment • ------AG---T----CGCTGC---- • ------AGCGAT--CGCGCTGC--- • ---TCGAGGCAA--GCTGCTGC----- • ------GGCGAT----CGCTGC----- • Problem hardness: • Optimal alignment : NP-hard • Almost kinds of object functions • What if more than 1000 sequence ? • SPACE COMPELXITY IN PRACTICE • Pairwise alignment • Star Alignment • Tree alignment

  24. Why multiple alignment ? • Finding Conserved regions • Computer virus phylogeny constructing • 300 sp./year • More than 10000 sp. : N • Number of files in a system : M > 100000 • Detecting a CV takes O( N*M) checks! • tuberculosis • 8 종, a conserved region, and polymorphic sites • 김철민 교수님(부산의대) – 진단용 칩 제작 • Phylogeny construct

  25. Phylogeny 1 : hard Version

  26. Phylogeny 2 : Probable version

  27. Constructing Phylogenetic Tree • Distance matrix • Optimal Tree ? • Degree constraint, Steiner points, Quartet method B D C E A

  28. PART 2: Application Detecting Source Code Plagiarism

  29. Plagiarism, Plagiarism, Plagiarism • Linear Structure • Genomic sequences • Plain articles • Programs • Human behaviors on the time-line • Time-series data sets • Student Reports Plagiarism • Assignment Program copying • Where is the original version of this one ? • Web searching redundancy elimination

  30. 지문법 기반 표절 검사 시스템

  31. 구조 기반 표절 검사 시스템

  32. Fingerprinting Method • Keyword frequency similarity • 특정한 단어의 사용횟수 • Fingerprinting object • Fixed size fingerprint • Easy to making Database • Quick searching • High false positive rate • Example, fingerprint vector A c x t u x g r N …..

  33. Attacking • Inserting redundant words • Shuffling • Cons and Pros • Easy to use in document application • Hard to use in program file • Recent trends • Structure-oriented similarity measure • Greedy-Block-Removing methods… • Is this a basic concept of local alignment ? • Sample-Report-Server Building

  34. Undergraduate Assignment • Programming Assignment cheating: • 이론적으로는 그 구별방법이 없다. • Assignment cheating은 비용이 크다 • Password breaking by Mafia • Assignment의 출력은 동일하다. • Correct program들끼리만 비교 • 과제에 주어진 시간은 비교적 짧다(3-4일 정도). • 수강생의 수는 적절하다(300 명 이하). • 프로그래밍 언어는 모두 동일하다.

  35. Program Cheating Techniques • Complete Copying • Variable exchange • Garbage code insertion • Function transpose • Code rewriting(partially) • Library code replacing • Merging different codes • Function resolving • Function rewriting

  36. Computing Space Transform • Program Space Genome Seq Space Program b Program a Protein a Protein b Basic keyword CLUSTAL-W( a, b ) Pairwise Alignment Similarity( , ) Similarity( , )

  37. PROGRAM to PROTEIN • Program Language • Keyword = { int, float, class….. } • Block Structure = “}”, “{“ • Program Chromosome • Location independent code, JAVA class, C files • Non-Coding region • /* this is a sample non-coding region */ • Promoter • Variable declaration, class definition • DNA = keyword sequence

  38. Extracting Program DNA • Syntactic level • Semantic level • Program Flow-graph Syntactic running syntax Real running

  39. Flow-Graph Linearization $ A B S R S B W Q % $ A B S R R R R S B W Q % $ A B W W W W Q % $ A W W Q % A S B R W Q

  40. Example main( ) { int i, j , k ; ………… for( I = 1 . I <= 100 , i++) { ……… if ( ) x = y ; else ……. while( ccccc ) { } x = 23984 ; } // end of for ……….. } int for if = else while = AGTCGCTTCGAAGCAA

  41. Why Protein mapping ? • DNA sequence overlap • if = AA, then = AG, * = GA, return = GG • AAGGA = AG + GA or AA + GG + A • Ambiguity resolving • 20 Amino acid bases • About 20 keywords • 2-3 groups • polar, non-polar • hydrophobic, hydrophilic • Charged, uncharged • Small, large

  42. Amino Acids classification

  43. Keyword Mapping Strategy • Convertibility = { for , while } Easy • Convertibility = { for, then } Hard • Convertibility = { if, ‘=‘ } Impossible? • Procedure • Preprocessing • Chromosome arrangement • Keyword selection • Protein mapping

  44. Copy Detecting System • CDS components = [K, M, P, T, A, G ] • Keyword table 20 keyword • Matching Score matrix borrow from Protein(PAM) • Affine Gap Penalty • Threshold length • Alignment Set Scoring • maXimum Gap allowing

  45. Experiment Overview • Sample programs “data structure” • Students , 60 • Programming assignments, 12 set • 1 semester • On-line evaluation system = ESPA • Java-based on-line evaluation system • Due, 1 week • We do not monitor all programs

  46. Clustal-W (1)(www2.ebi.ac.uk/clusterw) • Input Fasta file

  47. Clustal-W(2) (www2.ebi.ac.uk/clusterw) • Output

  48. PhyloDraw (cho et al, Bioinformatics 2001.)

  49. Experiment Result 1-0 • 유사한 그룹이 있는11개의 프로그램

  50. Experiment Result 1-1 • Unit distance topological Representation

More Related