Project In Bioinformatics (236524)- Projects Presentation.

Project In Bioinformatics (236524)- Projects Presentation. Presented by: Ma’ayan Fishelson

Proposed Projects • Performing haplotyping on the input data. • Creating a friendly user-interface for the statistical genetics program SimWalk2. • Performing approximate inference by using a heuristic which ignores extreme markers in the computation. • Performing approximate inference via Iterative Join-Graph Propagation.

Project #1

Efficient and accurate computational methods for haplotype reconstruction from genotype data are highly demanded. Haplotyping • Many applications require haplotype information. • Unfortunately, the human genome is diploid and therefore genotype information is collected and not haplotype information.

Project #1 • Goal of project #1: to perform haplotyping on the input data, i.e. to infer the most likely haplotypes for the individuals in the input pedigrees, via the MPE (Most Probable Explanation) query.

Haplotyping – General Definition • Input: a set of multilocus phenotypes for the individuals of a pedigree. • Note: some individuals may be untyped. • Output: the most likely configuration of haplotypes (ordered genotypes) for the pedigree, i.e. a configuration with maximum probability. • Note: there might be a couple of configurations with maximum probability.

Bayesian Network • X = {X1,…,Xn} is a set of random variables. • A BN is a pair (G,P): • G is a directed acyclic graph over nodes that represent the random variables X. • P = {Pi|1 ≤i ≤n}. Pi, defined on , is the conditional probability table associated with node Xi. Pi = P(Xi | pa(Xi)) • The BN represents a probability distribution over X.

Ga[1,p] Ga[1,m] Gb[1,p] Gb[1,m] Sc[1,p] Pb[1] Sc[1,m] Pa[1] Gc[1,m] Gc[1,p] Pc[1] locus #1 variables Ga[2,p] Ga[2,m] Gb[2,m] Gb[2,p] Sc[2,p] Pb[2] Sc[2,m] Pa[2] Gc[2,m] Gc[2,p] Pc[2] locus #2 variables Bayesian network built by SUPERLINK

Variables of the Bayesian Network (Likelihood Computation with Value Abstraction with N. Friedman, D. Geiger, and N. Lotner.) Three types of random variables: • Genetic Loci. the variables Gi[a, p] and Gi[a, m] represent the paternal & maternal alleles of individual i in locus a. – (orange nodes) • Phenotypes. variable Pi[a] denotes the value of phenotype a for individual i - (yellow nodes). • Selector variables. variables Si[a, p] and Si[a, m] are used to specify whether i got his alleles from the paternal or the maternal haplotype of his father and mother, respectively, at locus a – (green nodes).

Haplotyping – Specific to Superlink • Find the most likely assignment to all the genetic loci variables (orange nodes). • Method: use the MPE query – find using a bucket elimination algorithm, which is an algorithm for performing inference in a Bayesian network.

Bucket Elimination Alg. for MPE Given an ordering of the variables X1,…Xn: • Distribute P1,…,Pn into buckets B1,…,Bn. is the highest in order in Ai. • Backward part: Process the buckets in reverse order: BnB1. • Whenprocessing bucket Bi, multiply all the probability tables in Bi and eliminate the bucket’s variable Xi by keeping only the maximum joint probability entry for each possible assignment to the other variables. • Store the value of Xi which maximizes the joint probability function for each possible assignment to the other variables. • Place the resulting function in the bucket of the highest variable (in the order) that is in its scope. • Forward part: Process the buckets in the order: B1Bn. • Whenprocessing bucket Bi, (after choosing the partial assignment (x1,…xi-1)) choose the value of Xi which was recorded in the backward phase together with this assignment.

P(A) P(C|A) P(B|A) hD(A,B) P(E|B,C) hF(E) B2(C) B3(B) B4(E) B5(D) B6(F) B1(A) • Process B4(E): • Place hE(b,c)in bucket B3(B). • Record Eopt(b,c). P(B|A) hD(A,B) hE(B,C) P(A) P(C|A) Eopt(b,c) Dopt(a,b) F=1 B2(C) B4(E) B5(D) B6(F) B1(A) B3(B) Example MPE (cont. 3)

P(B|A) hD(A,B) hE(B,C) P(A) P(C|A) B2(C) B4(E) B5(D) B6(F) B1(A) B3(B) • Process B3: • Place hB(a,c)in bucket B2(C). • Record Bopt(a,c). P(C|A) hB(A,C) P(A) Bopt(a,c) Eopt(b,c) Dopt(a,b) F=1 B2(C) B3(B) B4(E) B5(D) B6(F) B1(A) Example MPE (cont. 4)

P(A) P(C|A) hB(A,C) B2(C) B4(E) B5(D) B6(F) B3(B) B1(A) • Process B2(C): • Place hC(a)in bucket B1(A). • Record Copt(a). P(A) hC(A) Copt(a) Bopt(a,c) Eopt(b,c) Dopt(a,b) F=1 B2(C) B3(B) B4(E) B5(D) B6(F) B1(A) Example MPE (cont. 5)

P(A) hC(A) Copt(a) Bopt(a,c) Eopt(b,c) Dopt(a,b) F=1 • Compute the maximum value associated with A: • Record the value of A which produced this maximum. B2(C) B3(B) B4(E) B5(D) B6(F) B1(A) Example MPE (cont. 6) Traverse the variables in the opposite order (C,B,E,D) to determine the rest of the most probable assignment.

Project #2

SimWalk2 • Astatistical genetics computer application for haplotype, parametric linkage, non-parametric linkage (NPL), identity by descent (IBD) and mistyping analyses on any size of pedigree. • Performs approximate computations using Markov Chain Monte Carlo (MCMC) and simulated annealing algorithms.

Project #2 • SimWalk2 requires 4/5 input files in order to run. These can be difficult for a non-expert to produce. • Goal of project #2: to create a friendly • web-based user-interface for the program • SimWalk2, using Java.

Project #3

Performing Approximate Inference • Algorithms for performing genetic linkage analysis are being improved constantly. • However: • due to the enormous development in the human genome project, knowledge about many markers exists. • Markers are highly polymorphic. • Some disease models depend on multiple loci. • Sometimes a model is too large for performing exact inference.

Project #3 • Currently Superlink performs exact inference. • Goal of project #3: provide the means for performing approximate inference via a heuristic which ignores extreme markers in the computations when these are too strenuous to be performed exactly.

General Outline of Heuristic Algorithm • Begin with the total number of markers as specified in the input. • Determine an elimination order for the problem as is. • Check the complexity of the elimination order found: • If it is greater than some determined threshold, clip off one of the extreme markers and return to step 2. • Else, continue to compute the likelihood.

Some possible options are: • The marker fartherfrom the disease locus. • The less informative marker. • A marker which is very close to its adjacent marker. Open Questions • Which marker to clip in each iteration?

Project #4

Project #4 • Another project which deals with approximate inference in a different way… • Goal of project #4: to provide the means for performing approximate inference by using Iterative Join-Graph Propagation.

Pearl’s Polytree Algorithm (BP – Belief Propagation) An exact inference algorithm for singly-connected networks. • Each node X computes BEL(x) = P(X=x|E), • (E is the observed evidence), by combining messages from: • its children, and • its parents.

Loopy Belief Propagation (Iterative-BP) • Uses Pearl’s polytree algorithm on a Bayesian network with loops. • Initialization: All messages are initialized to a vector of ones. • At each iteration: all nodes calculate their outgoing messages based on the incoming messages of their neighbors from the previous iteration. • Stopping condition: Convergence of messages. None of the beliefs in successive iterations changed by more than a small threshold (e.g.,10-4).

Generalized Belief Propagation (GBP) • An extension of IBP towards being an anytime algorithm. • Can be significantly more accurate than ordinary IBP at an adjustable increased complexity. • Central idea: improve the approximation by clustering some of the network’s nodes into super nodes and apply message passing between the super nodes rather than between the original singleton nodes.

Iterative Join-Graph Propagation (IJGP) • A special class of GBP (Generalized Belief Propagation) algorithms. • Pearl’s BP algorithm on trees was extended to a general propagation algorithm on trees of clusters – join-tree clustering (exact method). • IJGP extends this idea to a join-graph, by applying join-tree message passing over the join-graph, iteratively. • IJGP(i) – works on join-graphs having cluster size bounded by i variables. i allows the user to control the tradeoff between time and accuracy.

Belief Network -BN A quadruple BN= < X, D, G, P>: • X = {X1,…Xn} is a set of random variables. • D = {D1,…,Dn} is the set of corresponding domains. • G is a directed acyclic graph over X. • P = {p1,…,pn}, where pi = P(Xi | pai) (pai are the parents of Xi in G), denote probability tables.

Join-Graph Decomposition A triple D = < JG, χ, ψ > for BN= <X, D, G, P>: • JG = (V, E) is a graph. • χ,ψ are functions which associate with each vertex two sets and , such that: • Each function is associated with exactly one vertex . • (connectedness) For each variable , the set of vertices which are associated with it induces a connected sub-graph of JG.

A 1 χ(1) = {A, B, C} ψ(1) = {p(a), p(b|a), p(c|a,b)} B χ(2) = {B, C, D, F} ψ(2) = {p(d|b), p(f|c,d)} 2 C E D χ(3) = {B, E, F} ψ(3) = {p(e|b, f)} 3 F G χ(4) = {E, F, G} ψ(4) = {p(g|e, f)} 4 A Bayesian network. A join-tree decomposition. Join-Graph Decomposition: Example (in this case, a tree..)

Arc-Labeled Join-Graph Decomposition • A quadruple D = < JG, χ, ψ, θ > for BN= < X, D, G, P>: • JG = (V, E) is a graph. • χ, ψ are functions which associate with each vertex two sets and . • θassociates with each edge the set , such that: • Each function is associated with exactly one vertex . • (arc-connectedness) For each arc (u,v), , such that , any 2 clusters containing Xi can be connected by a path whose every arc’s label contains Xi.

Minimal Arc-Labeled Join-Graph Decomposition • An arc-labeled join graph is minimal if no variable can be deleted from any label while still satisfying the arc-connectedness property. • A minimal arc-labeled join-graph does not contain any cycle relative to any single variable.

Definition - Eliminator • Given 2 adjacent vertices u and v of JG, the eliminator of u with respect to v includes all the variables that appear in u and don’t appear on the arc (u,v). elim(u,v) = χ(u) - θ((u,v)).

Algorithm IJGP • Input: • An arc-labeled join graph decomposition. • Evidence variables var(e). • Output: • An augmented graph whose nodes are clusters containing the original CPTs and the messages received from neighbors. • Approximations of P(Xi|e), .

Algorithm IJGP – 1 iteration Apply message-passing in some topological order over the join graph, forward and back. When node u sends a message to a neighbor node v: • Compute individual functions: include in H(u,v) each function whose scope doesn’t contain variables in elim(u,v). Denote by A the remaining functions. • Compute the combined function: • Send all the functions to v: Send h(u,v) and the individual functions H(u,v) to node v.

1 2 3 4 Execution of IJGP on a Join-Tree ABC BC BCDF BF BEF EF EFG

Algorithm IJGP – computing beliefs • Compute P(Xi, e) for every : • let u be a vertex in JG such that • compute: where cluster(u) includes all the functions in u, including messages sent from its neighbors.

Bounded Join-Graphs • Join-graphs with cluster size bounded by i. A partition based approach to generate such decompositions: start from a given tree-decomposition and then partition the clusters until the decomposition has clusters bounded by i. Goal: allows to control the complexity of IJGP. The time and space complexity of 1 iteration of IJGP(i) is exponential in i.

Algorithm join-graph structuring(i) Output: a join-graph with cluster size bounded by i. • Apply procedure schematic mini-bucket(i). • Associate each resulting mini-bucket with a node in the join-graph. The variables of the node are those appearing in the mini-bucket. The original functions of the node are those in the mini-bucket. • Keep the arcs created by the procedure (out-edges) and label them by the regular separator. • Connect the mini-bucket clusters belonging to the same bucket in a chain by in-edges labeled by the single variable of the bucket.

Procedure schematic mini-bucket(i) • Order the variables from X1 to Xn, and associate a bucket with each variable. • Place each CPT in the bucket of the highest index variable in its scope. • For j=n to 1 do: • Partition the functions in bucket(Xj) into mini-buckets having at most i variables. • For each mini-bucket mb create a function (message) f where and place scope(f) in the bucket of its highest index variable. mb needs to be connected with an arc to the bucket of f (which will be created later).

Xn G: (GFE) A P(G|F,E) B GFE E: (EBF) (EF) P(E|B, F) EBF EF F: (FCD) (BF) P(F|C,D) C E D BF F FCD BF D: (DB) (CD) CD CDB F P(D|B) B CB C: (CAB) (CB) G CAB P(C|A,B) B: (BA) (AB) (B) BA BA P(B|A) X1 A A: (A) (A) A P(A) a. After applying schematic mini-bucket(3). b. After applying alg. join-graph structuring. Build Join-Graph:Example

IJGP(i) -summary • As i is increased, we get a more accurate performance, requiring more time to process. • This yields the anytime behavior of the algorithm.

Project In Bioinformatics (236524)- Projects Presentation.

Project In Bioinformatics (236524)- Projects Presentation.

Presentation Transcript

project management (modern) powerpoint presentation content:

MOBIlearn project presentation: strategy and working method

Pre-Competitive Collaboration in Clinical Trials March 17, 2014

SPPMG Event 18 th February 2010 Why (IT) Projects Fail

Bioinformatics and sequence analysis

Project Management

Project Management

Manage Projects

MANAGE AND IMPLEMENT SMALL PROJECTS

MANAGING PROJECTS USING NETWORK TECHNIQUES

PSS ARRA PROJECTS SAIPAN

Sunflower Project Business Process Workshop

Bioinformatics of Disease: immune epitope prediction

Bioinformatics For MNW 2 nd Year

Goals of the Human Genome Project (1990 ~) Map and sequence the 3,000 Mb human genome

CS 6293 Advanced Topics: Translational Bioinformatics

Bioinformatics

07-T061 Getting the best use of CCTV in the Railways

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops