Statistics for bioinformatics

Cuernavaca - Introduccion a bioinformatica Statistics for bioinformatics Jacques van HeldenJacques.van.Helden@ulb.ac.be

Statistics Applied to Bioinformatics Combinatorial analysis Jacques van HeldenJacques.van.Helden@ulb.ac.be

Problem - oligomers • How many oligomers contain exactly a single occurrence of each monomer, for oligonucleotides and oligopeptides, respectively ?

Permutations within a set - the factorial • How many distinct permutations can be made from a set of x elements ? • x = 2 2 • x = 3 3*2 = 6 • x = 4 4*3*2 = 24 • any x x*(x-1)...1 = x! • The factorialx!represents the number of possible permutations between x objects. • Solution to the problem of oligomers • There are 4!=24 distinct oligonucleotides with a single occurrence of each nucleotide (A, C, G, T) • There are 20!=2.4*1018 distinct oligopeptides with a single occurrence of each amino acid.

Problem - Selection of a subset of elements • A genome contains n=6000 genes. • We select a series of genes in the following way : • Once a gene has been selected once, it cannot be selected anymore (no replacement) • We are not interested in the order of the selection: if A and B were selected, we do not consider whether A came out in first or in second position. • How many possibilities do we have to select • 1 gene ? • 2 genes ? • 3 genes ? • x genes ?

Selection of a subset of elements • Number of possible outcomes • n size of the set • x size of the subset • Possible permutations among the elements of a subset • Number of distinct selections (orderless). • The coefficient Cxn represents the number of distinct choices of x elements among n. For this reason, it is called "Choose x among n". It is also called binomial coefficient (we will see later why).

Statistics Applied to Bioinformatics Elements of Probabilities Jacques van HeldenJacques.van.Helden@ulb.ac.be

Frequency of A Repetitions Frequential definition of probability • Trial • a random position is selected in the genome of Mycoplasma genitalium • Event • If the nucleotide at this position is a adenine (A), the trial is considered as a success • 1,000,000 successive trials were performed • The frequency of success is plotted as a function of the number of trials • The frequency progressively converges towards the value of 0.35 • The probability is the value that would be obtained with an infinite number of trials Selection of random nucleotides in a genome

Mutually exclusive events • E.g.: Calculating degenerate nucleotide probability • P(W) = P(A  T) = P(A) + P(T) • P(S) = P(C G) = P(C) + P(G) Where  is the logical OR  is the logical AND

Complementary events • E.g.: coding / non-coding sequences in Mycobacterium genitalium • P(coding) = 0.902 • P(coding non-coding) = 1 P(non-coding) = 1 - P(coding) = 0.098 • Example: Probability of the degenerate nucleotide N (in this example we combine the properties of complementarity and mutual exclusion) • P(N) = P(A) + P(T) + P(C) + P(G) = 1

Stochastic independence • Two events A and B are said stochastically independent when • P(A|B) = P(A|!B) = P(A) • P(B|A) = P(B|!A) = P(B) • For stochastically independent events, the joint probability is the product of probabilities • P(A B) = P(A)P(B) • E.g.: calculating oligonucleotide probability with a model of independent succession of nucleotides • P(GATAAG) = P(G) * P(A) * P(T) * P(A) * P(A) * P(G) • Note : this is not appropriate for biological sequences, where there are strong dependencies between neighbour nucleotides (see next slide)

Non necessarily independent events P(A  B) = P(A) + P(B) – P(A  B) P(A B C) = P(A) + P(B) + P(C) – P(A B) – P(A C) –P(B C)+ P(A B C)

Probabilities - Exercises • Assuming identically and independently distributed nucleotides, calculate the probability to observe an hexanucleotide containing exactly 4 As and 2 Cs, irrespective of the relative positions of these residues.

Probabilities - Exercises • Assuming a DNA sequence with independently and identicallydistributed nucleotides, calculate the probabilities of the followingoligonucleotides. • A • AA • AAAA • AAAAAA • CCCCCC • CACACA • CANNTG • CACGTK • Calculate the probabilities of the same oligonucleotides, assuming that nucleotides are independentlydistributed, but have the following prior probabilities • P(A) = 0.31; P(T) = 0.29 ; P(C) = 0.19; P(G) = 0.21

Statistics for bioinformatics Set comparisons Jacques van HeldenJacques.van.Helden@ulb.ac.be

Aspartate biosynthesis L-Aspartate ATP Aspartate kinase HOM3 2.7.2.4 ADP L-aspartyl-4-P NADPH Aspartate semialdehyde deshydrogenase HOM2 1.2.1.11 NADP+; Pi L-aspartic semialdehyde NADPH Homoserine deshydrogenase HOM6 1.1.1.3 NADP+ Threonine biosynthesis MET31 MET32 Met31pmet32p L-Homoserine AcetlyCoA Homoserine O-acetyltransferase MET2 2.3.1.31 CoA O-acetyl-homoserine Sulfur assimilation Sulfide O-acetylhomoserine (thiol)-lyase MET17 4.2.99.10 MET28 Homocysteine Cbf1p/Met4p/Met28p complex CBF1 Cysteine biosynthesis MET4 5-methyltetrahydropteroyltri-L-glutamate Methionine synthase (vit B12-independent) GCN4 Gcn4p MET6 2.1.1.14 5-tetrahydropteroyltri-L-glutamate L-Methionine MET30 Met30p S-adenosyl-methionine synthetase I SAM1 H20; ATP 2.5.1.6 S-adenosyl-methionine synthetase II Pi, PPi SAM2 S-Adenosyl-L-Methionine Methionine Biosynthesis in S.cerevisiae

The "bio-ontologies" • Answer to the problem of inconsistencies in database annotations • Controlled vocabulary • Hierarchical classification between the terms of the controlled vocabulary • E.g.: The Gene Ontology • molecular function ontology • process ontology • cellular component ontology

Gene ontology: processes

Gene ontology: molecular functions

Gene ontology: cellular components

Fragment of the GO annotations for the yeast Saccharomyces cerevisiae

Problem - selection within a set with classes • A given organism has 6,000 genes, among which 40 are involved in methionine metabolism. • A set of 10 genes are co-regulated in a microarray experiment. • Among them, 6 are related to methionine metabolism. • What would be the probability to observe such a correspondence by chance alone ? Methionine Co-regulated 34 6 4 Genome (6000)

Selection within a set with classes • Let us define • g = 6000 number of genes • m = 40 genes involved in methionine metabolism • n = 5960 genes not involved in methionine metabolism • k = 10 number of genes in the cluster • x = 6 number of methionine genes in the cluster • We calculate the number of possibilities for the following selections • C1: 10 distinct genes among 6,000 • C2: 6 distinct genes among the 40 involved in methionine • C3: 4 genes among the 5960 which are not involved in methionine • C4: 6 methionine and 4 non-methionine genes • Probability to have exactly 6 methionine genes within a selection of 10 • Probability to have at least 6 methionine genes within a selection of 10

The hypergeometric distribution • The hypergeometric distribution represents the probability to observe x successes in a sampling without replacement • m number of possible successes • n number of possible failures • k sample size • x number of successes in the sample • The shape of the distribution depends on the ratio between m and n • m << n i-shaped • m ~ n bell-shaped • m >> n j-shaped • The distribution is bounded on both sides (0  x  k) • Statistical parameters

Statistics Applied to Bioinformatics Multi-testing corrections Jacques van HeldenJacques.van.Helden@ulb.ac.be

Bonferoni rule • Multi-testing • Assessing the significance of each gene on a chip represents thousands of simultaneous tests. Let N be the number of genes. • The risk of error (P-value) associated to each gene will thus be challenged N times. • The significance thresholds used for single testing (0.01, 0.001) are thus likely to return many false positive. • Bonferoni rule • Adapt the threshold to the number of simultaneous tests.

E-value • An alternative but equivalent way to treat the problem of multi-testing is to calculate the expected value for each observation. • One can then choose the E-value according to the number of false positive considered as acceptable.

Family-wise Error Rate (FWER) • Another correction for multiple testing consists in estimating the probability to observe at least one false positive in the whole set of tests. This probability can be calculated quite easily from the P-value (Pval).

False Discovery Rate (FDR) • Yet another approach is to consider, for a given threshold on P-value, the False Discovery Rate, i.e. the proportion of false predictions within a set of predictions.

Summary - Multi-testing corrections • Bonferoni rule adapt significance threshold • E-value expected number of false positives • FWER Family-wise error rate: probability to observe at least one false positive • FDR False discovery rate: estimated rate of false positives among the predictions

Compare-classes result

Statistics Applied to Bioinformatics Application: comparing many gene sets with many gene sets Jacques van HeldenJacques.van.Helden@ulb.ac.be

RegulonDB factor -> gene network Factor-gene graph • RegulonDB (Oct. 2005 version) • The graph represents the relationships between factors and their target genes (factor -> gene graph) • 125 transcription factors • 467 target genes • 847 factor->gene interactions • 45 self-regulations • Note: CRP alone regulates 132 target genes.

Application : yeast protein complexes • High-throughput experiments led to the identification of thousands of protein complexes in the yeast Saccharomyces cerevisiae. • Note: These high-throughput experiments are likely to contain a given rate of noise • Question: • can we associate specific functions to these complexes ? • Approach: • Compare the gene composition of each complex with the functional classes associated in the Gene Ontology.

Yeast complexes versus GO

Questions • Can we detect pairs of transcription factors which co-regulate a significant number of genes ? • Are these factors (and their common target genes) invovled in particular biological functions ?

Regulons versus regulons ; gene factor GI alkA Ada 1788383 ada Ada 1788542 aidB Ada 1790630 adiA AdiY 1790558 agaZ AgaR 1789520 agaS AgaR 1789525 alsR AlsR 1790527 alsI AlsR 1790528 hyaA AppY 1787206 appC AppY 1787212 araB AraC 1786249 araC AraC 1786251 araJ AraC 1786595 araF AraC 1788211 araE AraC 1789207 caiT ArcA 1786224 lpdA ArcA 1786307 betI ArcA 1786505 betT ArcA 1786506 cyoA ArcA 1786635 dcuC ArcA 1786839 gltA ArcA 1786939 sdhC ArcA 1786940 sucA ArcA 1786945 cydA ArcA 1786953 moeA ArcA 1787049 focA ArcA 1787132 hyaA ArcA 1787206 ptsG ArcA 1787343 ndh ArcA 1787352 icdA ArcA 1787381 hemA ArcA 1787461 acnA ArcA 1787531 tpx ArcA 1787584 ...

Regulons versus functional classes ; gene factor GI alkA Ada 1788383 ada Ada 1788542 aidB Ada 1790630 adiA AdiY 1790558 agaZ AgaR 1789520 agaS AgaR 1789525 alsR AlsR 1790527 alsI AlsR 1790528 hyaA AppY 1787206 appC AppY 1787212 araB AraC 1786249 araC AraC 1786251 araJ AraC 1786595 araF AraC 1788211 araE AraC 1789207 caiT ArcA 1786224 lpdA ArcA 1786307 betI ArcA 1786505 betT ArcA 1786506 cyoA ArcA 1786635 dcuC ArcA 1786839 gltA ArcA 1786939 sdhC ArcA 1786940 sucA ArcA 1786945 cydA ArcA 1786953 moeA ArcA 1787049 focA ArcA 1787132 hyaA ArcA 1787206 ptsG ArcA 1787343 ndh ArcA 1787352 icdA ArcA 1787381 hemA ArcA 1787461 acnA ArcA 1787531 tpx ArcA 1787584 ...

Microarray groups (Gasch, 2000) versus Gene Ontology

Exercises • The nucleotides frequencies were computed for a given genomeF(A) = F(T) = 0.35 F(G) = F(C) = 0.15 • What is the probability to observe the following motif GATWNNHT at a given position of this genome ? • W means “A or T” • H means “not G” • N means “any nucleotide” • Justify the answer in term of combinations of events. • We searched the orthologs of two genes (A and B) in 80 bacterial genomes. • An ortholog was found for gene A in 50 genomes. • An ortholog was found for gene B in 60 genomes. • Among these, 40 genomes contain orthologs for both A and B. • Can we consider that A and B are found in common in a significant number of genomes ? • Give the name of the distribution which allows to model this type of situation. • Indicate the general formula which would allow to estimate this significance (the P-value). • Explain what each term of this formula represents. • In the formula, replace each variable by the appropriate number(you don’t need to calculate the final value). • What does the P-value represent ? How would you interpret a P-value of 15% in terms of risks ?

Answer to the first question • The nucleotides frequencies were computed for a given genomeF(A) = F(T) = 0.35 F(G) = F(C) = 0.15 • What is the probability to observe the following motif GATWNNHT at a given position of this genome ? • P(W) = P(A) + P(T) = 0.7 • P(N) = 1 • P(H) = 1 - P(G) = 0.85 • P(GATWNNHT) = P(G)P(A)P(T)2P(W)P(N)2P(H) = 0.15*0.35*0.35^2*0.7*0.85 = 0.00383 • Justify the answer in term of combinations of events. • P(W) = P(A) + P(T) because A and T are mutually exclusive at a same position • P(N) = 1 because A, C, G, T are complementary events (there is no other possible event) • P(H) = 1 - P(G) because H is the complementary event of G (not G) • P(GATWNNHT) = P(G)P(A)P(T)2P(W)P(N)2P(H) because at the nucleotides found at successive positions are mutually independent -> the joined probability is the product of probabilities.

We searched the orthologs of two genes (A and B) in 80 bacterial genomes. An ortholog was found for gene A in 50 genomes. An ortholog was found for gene B in 60 genomes. Among these, 40 genomes contain orthologs for both A and B. Can we consider that A and B are found in common in a significant number of genomes ? Give the name of the distribution which allows to model this type of situation. Indicate the general formula which would allow to estimate this significance (the P-value). Explain what each term of this formula represents. In the formula, replace each variable by the appropriate number(you don’t need to calculate the final value). What does the P-value represent ? How would you interpret a P-value of 14% in terms of risks ? The Hypergeometric distribution. The P-value is calculated with the inverse CDF (the right tail of the distribution). See below. Terms n1=60 number of labelled elements (genomes with an ortholog for gene A) n2=20 number of unlabelled elements (genomes without ortholog for gene A) n=50 the size of the selection (genomes with an ortholog for gene B). x=40 the number of labelled elements in the selection (genomes with orthologs for both A and B) See below. Answer to question 2

Statistics for bioinformatics

Statistics for bioinformatics

Presentation Transcript

Tools for BioInformatics

Statistics for bioinformatics

Statistics in Bioinformatics

Bioinformatics approaches for…

Computational Statistics with Application to Bioinformatics

Bioinformatics for Research

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Bioinformatics for Biologists

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Statistics for Molecular Biology and Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Portals for Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Computational Statistics with Application to Bioinformatics

Statistics for Molecular Biology and Bioinformatics