1 / 85

Motif Finding

Motif Finding. PSSMs Expectation Maximization Gibbs Sampling. Complexity of Transcription. A matrix describing a a set of sites. A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2

lorene
Télécharger la présentation

Motif Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif Finding PSSMs Expectation Maximization Gibbs Sampling

  2. Complexity of Transcription

  3. A matrix describing a a set of sites A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 Representing Binding Sites for a TF Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA • A single site • AAGTTAATGA • A set of sites represented as a consensus • VDRTWRWWSHD (IUPAC degenerate DNA)

  4. Nucleic acid codes

  5. TGCTG = 0.9 From frequencies to log scores w matrix f matrix A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 f(b,i)+ s(N) A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 Log() p(b)

  6. TFs do not act alone http://www.bioinformatics.ca/

  7. PSSMs for Liver TFs… HNF3 HNF1 HNF4 C/EBP

  8. PSSMs for Helix-Turn-Helix Motif

  9. Promoter…

  10. Promoter Weight Matrices (PWM)

  11. E.Coli PWMs

  12. Motifs can mutate on less important bases. The five motifs at top right have mutations in position 3 and 5. Representations called motif logos illustrate the conserved regions of a motif. Motif Logo 1234567 TGGGGGA TGAGAGA TGGGGGA TGAGAGA TGAGGGA Position: http://weblogo.berkeley.edu http://fold.stanford.edu/eblocks/acsearch.html

  13. Example: Calmodulin-Binding Motif (calcium-binding proteins)

  14. Sequence Motifs http://webcourse.cs.technion.ac.il/236523/Winter2005-2006/en/ho_Lectures.html

  15. Regulatory Motifs • Transcription Factors bind to regulatory motifs • Motifs are 6 – 20 nucleotides long • Activators and repressors • Usually located near target gene, mostly upstream

  16. Challenges • How to recognize a regulatory motif? • Can we identify new occurrences of known motifs in genome sequences? • Can we discover new motifs within upstream sequences of genes?

  17. Motif Representation • Exact motif: CGGATATA • Consensus: represent only deterministic nucleotides. • Example: HAP1 binding sites in 5 sequences. • consensus motif: CGGNNNTANCGG • N stands for any nucleotide. • Representing only consensus loses information. How can this be avoided? CGGATATACCGG CGGTGATAGCGG CGGTACTAACGG CGGCGGTAACGG CGGCCCTAACGG ------------ CGGNNNTANCGG

  18. PSPM – Position Specific Probability Matrix • Represents a motif of length k (5) • Count the number of occurrence of each nucleotide in each position

  19. PSPM – Position Specific Probability Matrix • Defines Pi{A,C,G,T} for i={1,..,k}. • Pi (A) – frequency of nucleotide A in position i.

  20. Identification of Known Motifs within Genomic Sequences • Motivation: • identification of new genes controlled by the same TF. • Infer the function of these genes. • enable better understanding of the regulation mechanism.

  21. PSPM – Position Specific Probability Matrix • Each k-mer is assigned a probability. • Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2

  22. Detecting a Known Motif within a Sequence using PSPM • The PSPM is moved along the query sequence. • At each position the sub-sequence is scored for a match to the PSPM. • Example: sequence = ATGCAAGTCT…

  23. Detecting a Known Motif within a Sequence using PSPM • The PSPM is moved along the query sequence. • At each position the sub-sequence is scored for a match to the PSPM. • Example: • sequence = ATGCAAGTCT… • Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5*10-4

  24. Detecting a Known Motif within a Sequence using PSPM • The PSPM is moved along the query sequence. • At each position the sub-sequence is scored for a match to the PSPM. • Example: • sequence = ATGCAAGTCT… • Position 1: ATGCA 0.1*0.25*0.1*0.1*0.6=1.5*10-4 • Position 2: TGCAA 0.5*0.25*0.8*0.7*0.6=0.042

  25. Detecting a Known Motif within a Sequence using PSSM • Is it a random match, or is it indeed an occurrence of the motif? • PSPM -> PSSM (Probability Specific Scoring Matrix) • odds score matrix: Oi(n) where n {A,C,G,T} for i={1,..,k} • defined as Pi(n)/P(n), where P(n) is background frequency. • Oi(n) increases => higher odds that n at position i is part of a real motif.

  26. PSSM as Odds Score Matrix • Assumption: the background frequency of each nucleotide is 0.25. • Original PSPM (Pi): • Odds Matrix (Oi): • Going to log scale we get an additive score,Log odds Matrix (log2Oi):

  27. Calculating using Log Odds Matrix • Odds  0 implies random match; Odds> 0 implies real match (?). • Example: sequence = ATGCAAGTCT… • Position 1: ATGCA -1.32+0-1.32-1.32+1.26=-2.7odds= 2-2.7=0.15 • Position 2: TGCAA1+0+1.68+1.48+1.26 =5.42odds=25.42=42.8

  28. Calculating the probability of a match • ATGCAAG • Position 1 ATGCA = 0.15 • Position 2 TGCAA = 42.3 • Position 3 GCAAG =0.18 P (1)= 0.003 P (2)= 0.993 P (3) =0.004 P (i) = S / (∑ S) Example 0.15 /(.15+42.8+.18)=0.003

  29. Building a PSSM • Collect all known sequences that bind a certain TF. • Align all sequences (using multiple sequence alignment). • Compute the frequency of each nucleotide in each position (PSPM). • Incorporate background frequency for each nucleotide (PSSM).

  30. Finding new Motifs • We are given a group of genes, which presumably contain a common regulatory motif. • We know nothing of the TF that binds to the putative motif. • The problem: discover the motif.

  31. Example Predicting the cAMP Receptor Protein (CRP) binding site motif

  32. Extract experimentally defined CRP Binding Sites GGATAACAATTTCACA AGTGTGTGAGCGGATAACAA AAGGTGTGAGTTAGCTCACTCCCC TGTGATCTCTGTTACATAG ACGTGCGAGGATGAGAACACA ATGTGTGTGCTCGGTTTAGTTCACC TGTGACACAGTGCAAACGCG CCTGACGGAGTTCACA AATTGTGAGTGTCTATAATCACG ATCGATTTGGAATATCCATCACA TGCAAAGGACGTCACGATTTGGG AGCTGGCGACCTGGGTCATG TGTGATGTGTATCGAACCGTGT ATTTATTTGAACCACATCGCA GGTGAGAGCCATCACAG GAGTGTGTAAGCTGTGCCACG TTTATTCCATGTCACGAGTGT TGTTATACACATCACTAGTG AAACGTGCTCCCACTCGCA TGTGATTCGATTCACA

  33. Create a Multiple Sequence Alignment GGATAACAATTTCACA TGTGAGCGGATAACAA TGTGAGTTAGCTCACT TGTGATCTCTGTTACA CGAGGATGAGAACACA CTCGGTTTAGTTCACC TGTGACACAGTGCAAA CCTGACGGAGTTCACA AGTGTCTATAATCACG TGGAATATCCATCACA TGCAAAGGACGTCACG GGCGACCTGGGTCATG TGTGATGTGTATCGAA TTTGAACCACATCGCA GGTGAGAGCCATCACA TGTAAGCTGTGCCACG TTTATTCCATGTCACG TGTTATACACATCACT CGTGCTCCCACTCGCA TGTGATTCGATTCACA

  34. Generate a PSSM

  35. Expected variation per column can be calculated Low entropy means higher conservation Shannon Entropy

  36. Entropy • The entropy (H) for a column is: • a: is a residue, • fa: frequency of residue a in a column, • pa : probability of residue a in that column

  37. Entropy • entropy measures can determine which evolutionary distance (PAM250, BLOSUM80, etc) should be used • Entropy yields amount of information per column (discussed with sequence logos in a bit)

  38. Log-odds score • Profiles can also indicate log-odds score: • Log2(observed:expected) • Result is a bit score

  39. Matlab • Multalign 1 Enter an array of sequences. seqs = {'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTAACATCTCGC'}; 2 Promote terminations with gaps in the alignment. multialign(seqs,'terminalGapAdjust',true) ans = --CACGTAACATCTC-- ACGACGTAACATCTTCT -AAACGTAACATCTCGC

  40. Matlab 3 Compare alignment without termination gap adjustment. multialign(seqs) ans = CA--CGTAACATCT--C ACGACGTAACATCTTCT AA-ACGTAACATCTCGC

  41. Matlab >> a={'ATATAGGAG','AATTATAGA','TTAGAGAAA'} >> a = 'ATATAGGAG' 'AATTATAGA' 'TTAGAGAAA'

  42. Char function >> cseq=char(a) cseq = ATATAGGAG AATTATAGA TTAGAGAAA

  43. Double function >> intseq=double(cseq) intseq = 65 84 65 84 65 71 71 65 71 65 65 84 84 65 84 65 71 65 84 84 65 71 65 71 65 65 65

  44. double >> double('A') ans = 65 >> double('C') ans = 67 >> double('G') ans = 71 >> double('T') ans = 84

  45. Initiate PSPM matrix >> Pspm=zeros(4,length(intseq)) Pspm = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

  46. Use a for loop to count each nucleotide at each position >> for i = 1:length(intseq) Pspm(1,i)=length(find(intseq(:,i)==65)); Pspm(2,i)=length(find(intseq(:,i)==67)); Pspm(3,i)=length(find(intseq(:,i)==71)); Pspm(4,i)=length(find(intseq(:,i)==84)); end >> Pspm Pspm = 2 1 2 0 3 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 1 1 1 1 2 1 2 0 1 0 0 0

  47. Add pseudocounts >> Pspmp=Pspm+1 Pspmp = 3 2 3 1 4 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 2 1 3 2 2 2 2 3 2 3 1 2 1 1 1

  48. Normalize to get frequencies >> Pspmnorm=Pspmp./repmat(sum(Pspmp),4,1) Pspmnorm = Columns 1 through 7 0.4286 0.2857 0.4286 0.1429 0.5714 0.1429 0.4286 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.1429 0.2857 0.1429 0.4286 0.2857 0.2857 0.4286 0.2857 0.4286 0.1429 0.2857 0.1429 Columns 8 through 9 0.4286 0.4286 0.1429 0.1429 0.2857 0.2857 0.1429 0.1429

  49. Calculate odds score >> Pswm=Pspmnorm/0.25 Pswm = Columns 1 through 7 1.7143 1.1429 1.7143 0.5714 2.2857 0.5714 1.7143 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 0.5714 1.1429 0.5714 1.7143 1.1429 1.1429 1.7143 1.1429 1.7143 0.5714 1.1429 0.5714 Columns 8 through 9 1.7143 1.7143 0.5714 0.5714 1.1429 1.1429 0.5714 0.5714

  50. Log odds ratio >> logPswm=log2(Pswm) logPswm = Columns 1 through 7 0.7776 0.1926 0.7776 -0.8074 1.1926 -0.8074 0.7776 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 -0.8074 0.1926 -0.8074 0.7776 0.1926 0.1926 0.7776 0.1926 0.7776 -0.8074 0.1926 -0.8074 Columns 8 through 9 0.7776 0.7776 -0.8074 -0.8074 0.1926 0.1926 -0.8074 -0.8074

More Related