780 likes | 1.01k Vues
http://www.gfxtra.com/dl/texture+tree+pines. Some evolutionary tree reconstruction problems in computational biology. Chen Yen Hung Taipei Municipal University of Education. Outline. Introduction to Bioinformatics and Computational Biology
E N D
http://www.gfxtra.com/dl/texture+tree+pines Some evolutionary tree reconstruction problems in computational biology Chen Yen Hung Taipei Municipal University of Education
Outline • Introduction to Bioinformatics and Computational Biology • Introduction to the evolutionary tree reconstruction problems • Tree Alignment Problems • Steiner Tree Problems (STP) • Full Sibling Reconstruction problems • Our algorithms for solving these problems • Conclusions 本投影片只使用元智大學資訊工程系演講用
計算機的三種能力 • 計算的能力 • Algorithms (Program), Parallel algorithms,… • 儲存的能力 • Data base, Files, Memory… • 通訊的能力 • Networks, Communication, Mobile phone…
http://commons.wikimedia.org/wiki/File:Rat_eating_or_praying%3F.jpghttp://commons.wikimedia.org/wiki/File:Rat_eating_or_praying%3F.jpg http://commons.wikimedia.org/wiki/File:Alien.png Computational Biology and Bioinformatics
Computational Biology and Bioinformatics • Bioinformatics is the application of statistics, applied mathematics and computer science to solve biological problems. • Computational biology is an interdisciplinary field that applies the techniques of computer science, applied mathematics and statistics to address biological problems. The main focus lies on developing mathematical modeling, computational simulation techniques and algorithm design and analysis. http://en.wikipedia.org/wiki/Bioinformatics http://en.wikipedia.org/wiki/Computational_biology
Ref:新興的生物資訊學作者:趙坤茂/ 臺灣大學電機資訊學院資訊工程系http://web1.nsc.gov.tw/ct.aspx?xItem=8270&ctNode=40&mp=1 DNA • 人類有細胞有2 3對染色體,它們其實是由捲得很緊密的D N A所構成。D N A已被證明是遺傳的基本物質,它是由A、G、C及T四種鹼基組合而成的長鏈分子。所謂基因,就是指那些儲存蛋白質製造模具的DNA片段。 1 9 5 3年Crick and Watson發表於《Nature》有關D N A雙螺旋結構的論文 http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/images/molecularmachine.jpg
生物資訊的基本邏輯 Sequence/Structure Similarity Sequence/Structure Homology Functional Conservation (功能性保留) Phylogenetic Analysis (親緣關係分析)
一級: 二級: Red: Helix Yellow: Sheet Other: coil 三級: Viagra Protein Structure http://commons.wikimedia.org/wiki/File:Pancreatic_lipase%E2%80%93colipase_complex_with_inhibitor_1LPB.png
Protein Structure • Proteomics Center 利用Amazon EC2搭配 Amazon S3建構出一套雲端運算服務,提供蛋白質體學(proteomics)分析,名為ViPDAC(http://proteomics.mcw.edu/vipdac),透過 ViPDAC web interface選取參數與上傳自己的spectra in .mgf格式檔案,上傳至雲端上的Amazon S3,即可交由Amazon EC2虛擬的cluster運算,並將結果儲存於Amazon S3,最後可將處理結果下載到個人電腦進行進一步的分析 http://www.bcrc.firdi.org.tw/detail_news.do?newsid=213931386
Drug Design • 惡魔果實 • 藍波球 http://www.iidmm.uct.ac.za/sturrock/research.htm http://commons.wikimedia.org/wiki/File:Aldose_reductase_1us0.png http://commons.wikimedia.org/wiki/File:Quinacrine_mustard_in_Trypanothione_reductase_active_site.png
What do Computer Scientists do? • 將重要且有趣的生物研究的問題model成計算機問題(輸入/輸出) • 設計演算法解決這些問題
Ref:高密度片段的尋找──生物資訊學的問題重整作者:呂學一/ 臺灣大學電機資訊學院資訊工程系http://web1.nsc.gov.tw/ct.aspx?xItem=8273&ctNode=40&mp=1 What do Computer Scientists do? • 生物資訊上面有許多難解的問題,資料量都很大,問題都相當難(NP-hard),需要透過電腦程式的幫忙與數學的分析,使生物學家更能處理手邊的問題。 • 科學家遇到一個不熟悉的問題時,會把原始的問題轉換成比較熟悉的方式,進而藉由新問題相關領域中的工具解決原先的問題。在生物資訊相關研究中,我們常會看到。
Sequencing Problem • Input : • Output : AGACTAGTCTGTATAGACTAGCCT • Reduction: Maximum independent set in the interval graph
Intractable problems We will learn how to mathematically characterize the difficulty of computational problems. There is a class of problems that can be solved in a reasonable amount of time and another class that cannot(What good is it for a problem to be solvable, if it cannot be solved in the lifetime of the universe?) The field of cryptography, for example, relies on the fact that the computational problem of “breaking a code” is intractable
CMI Millennium Prize Clay Mathematics Institute (Cambridge, MA, USA) offered US$1,000,000 for each of seven open problems on May 24, 2000 at Paris. | Birch and Swinnerton-Dyer Conjecture | Hodge Conjecture | Navier-Stokes Equations | P vs NP | Poincare Conjecture | Riemann Hypothesis | Yang-Mills Theory | http://www.csie.ntu.edu.tw/~hil/teach.html
Grigory Perelman 1966 ~ Russian mathematician Solved the Poincaré Conjecture in 2003 Fields medalist, 2006 Declined to accept the award http://elsecretodezara.blogspot.com/2008/09/el-hombre-mas-inteligente-del-mundo.html http://www.csie.ntu.edu.tw/~hil/teach.html
How “hard” is NP-complete? • So far, scientist only knows how to solve such a NP-complete problem in O(cn) time for some constant c. (目前NP-complete問題都是指數時間) http://www.iis.sinica.edu.tw/~hil/wput/approx-mcu-2004.ppt http://www.iis.sinica.edu.tw/~hil/random/ra20040225.ppt
On the bright side… • If you can come up with an algorithm for such a problem that runs in O(n1000000) time, then you will be awarded Turing Award for sure plus US$1,000,000 from CMI. http://www.iis.sinica.edu.tw/~hil/random/ra20040225.ppt
NP-completeness = Dead end ? http://activerain.com/blogsview/108625/realtor-killed-in-the-basement http://www.iis.sinica.edu.tw/~hil/random/ra20040225.ppt
NP-completeness = Dead end ? • Or maybe we can settle for good algorithms? 1. Heuristic algorithms 2. Approximation algorithms 3. Randomized algorithms http://www2.ee.ntu.edu.tw/~yen/courses/al-01/approximation.ppt
Approximation • An algorithm which returns an answer C (近似解 C) which is “close” to the optimal solution C*(最佳解C*) is called an approximation algorithm. • “Closeness” is usually measured by the ratio bound (n) the algorithm produces. • Which is a function that satisfies, for any input size n, max{C/C*,C*/C}(n). • 最小化問題: C*C(n)C*, (n)>1 • 最大化問題: CC*(n)C, (n)>1 http://www2.ee.ntu.edu.tw/~yen/courses/al-01/approximation.ppt
Approximation Algorithm • Criterion 1: feasibility (一定要是合法解) • Always output a feasible solution. • Criterion 2: tractability (多項式時間) • Always runs in polynomial time. • Criterion 3: quality (有保證) • The solution’s quality is always provably not too far from that of an optimal solution. http://www.iis.sinica.edu.tw/~hil/random/ra20040225.ppt
Evolutionary Tree Reconstruction in Biology http://commons.wikimedia.org/wiki/File:Human-evolution.jpg http://commons.wikimedia.org/wiki/File:Phylogenetic_tree_of_Theropods_respiratory_system_01.JPG
Evolutionary Tree Reconstruction in Biology • To reconstruct the evolutionary tree of the extinct species from present-day species. • Tree structure can be given (from the inference or previous known data) or unknown. • The tree structure can be rooted or unrooted. http://insystemicthinking.wordpress.com/2007/12/11/funny-you-dont-look-different/ http://www.niu.edu/pubaffairs/releases/2000/mar/primate/tree.html http://darwinshealthclass.blogspot.com/2011/02/my-evolution.html
Phylogenetic (Evolutionary) Tree Reconstruction Problem • Input :present-day species (DNA sequences) & tree structure (alternative) • Output :an evolutionary tree • Distance of two species (sequences): evolutionary time or some distance metrics such asHamming distance,Levenshtein (Edit) distance…. • Goal :Depends on the distance metrics (Ex: MiniMax (Bottleneck), MiniSum, MiniSize)
Tree Alignment Problem (TAP) • Input : a set Wof n sequences (strings) and a tree structure T with n leaves, each of which is labeled with a unique sequence in W • Output : To label a sequence to each internal vertex of T . • The distance on a edge of the tree is defined Edit distance between the two sequences which labeled to the two ends of a edge. • Goal : to find a tree alignment such that the sum of the Edit distance of all its edges is minimized.
Levenshtein (Edit) distance • An alphabet is a non-empty set of symbols. Given alphabet , the set * of all finite length sequences of symbols from . Given two sequences w and w’, we say wrewrites into w’ in one step if one of the following correction rules holds: • (1) w=axbw’=ab, and a,b*, x (single-symbol deletion x in w) • (2) w=ab, w’=axb ,and a,b*, x (single-symbol insertion x in w) • (3) w=axb, w’=ayb ,and a,b*, x,y, xy (single-symbol substitution)
Levenshtein (Edit) distance • The Levenshtein (Edit ) distance between w and w’ denoted as d(w,w’) is the smallest steps such that w rewritesinto w’. • Ex: d(xyxx, xxy)=2. A deletion of y from xyxx plus a substitution of last x by y.
Bottleneck Tree Alignment Problem (BTAP) • Input : a set Wof n sequences (strings) and a tree structure T with n leaves, each of which is labeled with a unique sequence in W • Output : To label a sequence to each internal vertex of T . • Goal : to find a tree alignment such that the edit distance of the largest edge is minimized.
An Example of the Tree Alignment W={TGC, ATGC, A, TGCG, AATT, TTATT} The total distance : The bottleneck distance: Distance : Levenshtein (Edit) Distance T wc TAP BTAP 1 2 wa=AGC wa=TGC wa wb wb=TATT wb=TATT 0 1 1 1 1 3 2 1 wc=TGT wc=AGT 2 w5 w6 w1 w2 w4 w3 11 10 3 2
Our results • NP-complete for the bottleneck tree alignment problem. • An O(n3+n2L2 ) - time for the bottleneck lifted tree alignment problem, where L is the maximum lengths of the sequences in W. • AnO(nL2) -timefor the bottleneck tree alignment problem when the distances function satisfy ultrametric.
Steiner Tree Problem in Graphs • Input: a graph G=(V,E) with a length function d: ER+ and a set of terminals R V • Output: a tree of G spanning all vertices in R • Objective: minimize the length of T(min-sum)
An Example of STP 2 G(V,E), R={a, b, g, f} g b d 1 4 3 1 3 e h 5 1 1 2 3 3 2 f a c 1 1 i : Terminal vertex :Steiner vertex
An Example of STP 2 g b d 1 4 3 1 3 e h 5 1 1 2 3 3 2 f a c 1 1 i
Other Purpose • Multicasting http://www.cisco.com/en/US/tech/tk828/tech_brief09186a00800a4415.html • To finding an optimal solution of the STP is NP-complete.
Previous Results for STP • Constant ratio approximation algorithms: • 2[ Hwang, Richards & Winter, 1992] • 11/6 by Zelikovsky [Algorithmica’93] • 16/9 by Berman & Ramayer [Journal of Algorithms’94] • 1.73 by Borchers & Du [SIAM Journal on Computing’97] • 5/3 by Promel & Steger [Journal of Algorithms’00] • 1.64 by Karpinski & Zelikovsky [Journal of Combinatorial Optimization’97] • 1.59 by Hougardy & Promel [SODA’99] • 1+ln 3/2 1.55by Robins & Zelikovsky [SIAM DM’05] • ln4+ε 1.39by Byrka, Grandoni, Rothvob, Sanità [STOC 10] • Cannot be approximated better than 1.006 by Thimm[TCS’03]
Terminal Steiner Tree Problem • TerminalSteiner tree: a Steiner tree with all terminals as its leaves • Terminal Steiner tree problem(TSTP): • Input: a complete graph G=(V,E) with a length function d: ER+ and a set of terminals R V • Output: a Terminal Steiner tree for R in G • Objective: minimize the length ofT(min-sum)
An Example of TSTP 2 G(V,E): A Complete Graph R={a, b, g, f} g b d 1 4 3 1 3 e h 5 1 1 2 3 3 2 f a c 1 1 i : Terminal vertex :Steiner vertex
An Example of TSTP • To finding an optimal solution is NP-complete. 2 g b d 1 4 3 1 3 e h 5 1 1 2 3 3 2 f a c 1 1 i
Previous Results for TSTP • Performance ratios of approximation algorithms: • ρ+2 by Lin & Xue [IPL’02] for the case in which the length function is metric (i.e., satisfying the triangle inequality), where ρis best-known performance ratio for the STP • 8/5by Lu, Tang & Lee [TCS03] for the special case in which the edge lengths are either 1 or 2 • Our result for TSTP (with metric length function) : • 2ρ-approximation algorithm [cocoon03] • 2ρ-(ρ)/ (3ρ-2) 2.515-approximation algorithm by Martineza, Pina, Soares [TCS07]
Bottleneck Steiner Tree Problem • Input: a graph G=(V,E) with a length function d: ER and a set of terminals R V • Output: a tree of G spanning all vertices in R • Objective: minimize the length of the largest edge in T (min-max) • Previous time-complexity: • O(|N|2+|E|loglog|E|) time by Chiang, Sarrafzadeh & Wong [IEEE on CAD’90 ] • O(|E|) time by Duin & Volgenant [EJOR’97]
Bottleneck Terminal Steiner Tree Problem • BottlencekTerminal Steiner Tree problem • Input:a complete graph G=(V,E) with a length function d: ER+ and a set of terminals R V • Output: a full Steiner tree • Objective: minimize the length of the largest edge in T (min-max) • Our result for BTSTP : O (|E|log |E|) time
An Example of BTSTP • Optimal solution for TSTP • Optimal solution for BTSTP • Input graph 8 2 2 1 2 1 3 7 4 7 4 3 6 10 A B 9 B A
Approximation Algorithm for TSTP • We assume that G contains no edge between any two terminals. • We apply the best-known approximation algorithm for the STP and obtain a Steiner tree SAPX for R in G. • If all vertices of R are leaves in SAPX, then SAPX is a terminal Steiner tree of G; otherwise, we use the following Algorithm 1 to transform it into a terminal Steiner tree.
Some Notation for Algorithm 1 2 3 g b d 1 • NG(r):the set of the neighbors of r R in G • Note that the members of N(r) are all Steiner vertices, because we assume that G contains no edge between any two terminals. • NGS(r): the nearest neighbor of r in G • That is, d(r, NGS(r)) = min{d(r, v) | vNG(r) }. • D(NGS(R)): thesum of the lengths of all the edges of NGS(r) of rR in G 4 3 1 h e 1 5 1 2 3 3 2 f a c 1 1 i
Some Notation for Algorithm 1 • N(r):the set of the neighbors of r R in SAPX • Note that the members of N(r) are all Steiner vertices, because we assume that G contains no edge between any two terminals. • N1(r): the nearest neighbor of r in SAPX • That is, d(r, N1(r)) = min{d(r, v) | vN(r) }. • N2(r): the second nearest neighbor of r in SAPX • For example, N1(r) N(r) 2 3 1 r
Algorithm 1 /* To transform SAPX into a terminal Steiner tree */ For each r R with |N(r)| 2 in SAPXdo • Remove all the edges in star(r) \ { (r, N1(r))} from SAPX • star(r): the subtree of SAPX induced by {(r, v)|v N(r) } • Find a minimum spanning tree MST(N(r)) of G[N(r)] and add all the edges of MST(N(r)) into SAPX • G[N(r)]: the subgraph of G induced by N(r) End For N1(r2) star(r1) N1(r1) r1 r2 star(r2)
Approximation Ratio 1 • Let Toptand Soptbe the optimal terminal Steiner tree and Steiner tree in G, respectively. • len(Sopt) len(Topt) since Topt is also a Steiner tree. • len(SAPX) len(Sopt), since SAPX is obtained by the currently best-known approximation algorithm for the STP whose performance ratio is . • len(MST(N(r))) 2 len(star(r)) - d(r,N1(r) - d(r,N2(r)) by triangle inequality. • len(TAPX1)= len(SAPX)+ r R (len(MST(N(r)))-len(star(r)) +d(r,N1(r)))2 len(Sopt)- D(NGS(R)) N1(r1) N1(r2) star(r1) r2 r1
Approximation Ratio 2 • NG(r):the set of the neighbors of r R in G • Note that the members of N(r) are all Steiner vertices, because we assume that G contains no edge between any two terminals. • NGS(r): the nearest neighbor of r in G • That is, d(r, NGS(r)) = min{d(r, v) | vNG(r) }. • D(NGS(R)): thesum of the lengths of all the edges of NGS(r) of rR in G • d(u,v)={ d(u,v)+3d(u,NGS(r)), if uR and vNG(r) • d(u,v) , otherwise g 2 +3*2 3 +3*1 g b d 1 4 +3*2 +3*1 3 1 h e 1 5 +3*1 2 1 +3*1 3 +3*1 +3*1 3 +3*1 2 f a c 1 +3*1 1 +3*1 i
Algorithm 2 /* To transform SAPX into a terminal Steiner tree */ For each r R with |N(r)| 2 in SAPXdo • Remove all the edges in star(r) from SAPX • star(r): the subtree of SAPX induced by {(r, v)|v N(r) } • Find a minimum spanning tree MST(N(r)) of G[N(r)] and add all the edges of MST(N(r)) into SAPX • G[N(r)]: the subgraph of G induced by N(r) • Add all the edges in (r,NGS(r)) End For NGS(r2) star(r1) NGS(r1) r1 r2 star(r2)