310 likes | 419 Vues
This report delves into the analysis of segmentation sequences in genomic data, highlighting overrepresented strings and their contextual relationships. It includes updates on enrichment and depletion assessments for round 8 segmentations while examining probabilities associated with string occurrences. The analysis utilizes RNAseq contigs data from CSHL to evaluate lengths of observed to expected ratios, revealing insights into segment behavior. Future steps involve enhancing the evaluation of multi-segment strings and investigating the functional implications of sequence similarity.
E N D
1 Overrepresented Segment Strings(Aug/8/2011) Bob Harris Penn State Center for Comparative Genomics and Bioinformatics rsharris@bx.psu.edu
Overview • Analysis of segmentation sequences, incorporating longer local context • Update of previous enrichment/depletion plots • For the round8 segmentations
Motivation Quick eyeball test using one-character class-encoding: A=class 0 B=class 1 … 2,13,24 is C,N,Y > segway.k562.coordinated chr10:812820-872329 AOUNDKAGAGXGRXNXNCDUXNYNUNCNCYCYCYNYCYCYCNCYNCYCNX CNCYCNXNYNYNCNCYCNDCYNDYCYCYCNCICICDNXCICIWTMJMTWI CYCYNCBDUXRNCURDXNUDUVRGVUAVAGKUVUXGAVARXRDKDVXKXA GAXDXRAXRVKPBPIQBQBQVBQBQLQHQHLQVKQVQVLTLBVUVQVKVL QVQBVLVQVOVQLQLQLQLQLQLHLVUVQLVLQLQLQVLQLQHQLVLQVL
Redundancy Apparent, but… 4 • How surprising are the C,N,Y (2,13,24) groups? • Together these classes have only average probability • But 1st and 2nd order probabilities favor continuing in this group > segway.k562.coordinated chr10:812820-872329 AOUNDKAGAGXGRXNXNCDUXNYNUNCNCYCYCYNYCYCYCNCYNCYCNX CNCYCNXNYNYNCNCYCNDCYNDYCYCYCNCICICDNXCICIWTMJMTWI CYCYNCBDUXRNCURDXNUDUVRGVUAVAGKUVUXGAVARXRDKDVXKXA GAXDXRAXRVKPBPIQBQBQVBQBQLQHQHLQVKQVQVLTLBVUVQVKVL QVQBVLVQVOVQLQLQLQLQLQLHLVUVQLVLQLQLQVLQLQHQLVLQVL
Overrepresented Strings 5 • String of 2N segments • Estimate expected probability with Nth order model • e.g. pr(ABCD) = pr(AB) pr(C|AB) pr(D|BC) • “Evaluate” strings with high observed:expected ratio • Comparison to “features”. In this case RNAseq contigs • Caveat(?): length of segments ignored
Overrepresented Strings, Example 6 • Length-4 strings in segway.k562.coordinated • Highest obs/exp ratio, after eliminating rare observations string #obs’d #exp’d obs/exp 21-10-0-21 3761 970.80 3.874112 21-0-10-21 3561 966.65 3.683865 13-23-20-13 5227 2386.44 2.190296 13-20-23-13 5177 2371.56 2.182953 13-23-17-13 3205 1530.04 2.094711 13-17-23-13 3156 1535.76 2.055004 16-21-11-16 4833 2466.86 1.959174 14-23-17-14 3263 1711.13 1.906928 16-11-21-16 4629 2443.15 1.894687 10-6-0-10 6980 3686.84 1.893222 14-17-23-14 3180 1686.41 1.885658 10-0-6-10 6846 3632.72 1.884536 23-0-6-23 3265 1748.77 1.867023 23-6-0-23 3254 1749.80 1.859644 23-6-14-23 8780 4821.21 1.821121 23-14-6-23 8933 4927.23 1.812985 24-13-3-24 5419 3007.67 1.801727 23-0-14-23 7142 4023.34 1.775141 24-3-13-24 5270 2987.69 1.763906 23-6-10-3 3045 1734.93 1.755115 24-3-10-3 3192 1832.07 1.742287 3-10-6-23 3046 1751.86 1.738724 23-14-0-23 7000 4028.87 1.737461 3-10-3-24 3126 1809.36 1.727681 …
CSHL RNAseq contigs 7 • CSHL RNAseq contigs • ftp: //genome.crg.es/pub/Encode/data_analysis/ • ForDeadZones/Contigs_IDR0.1_CSHL.tar.gz • Differentiated by cell line (14), compartment (6), RNA fraction (4) • and attributed to 11 biotypes (gencode v7 exons) • non coding, protein coding, etc. • and a 12th type — empty, or “no exon” • From Sarah Djebali, Felix Schlesinger, Wei Lin
Measuring Enrichment 8 • Vf,s = enrichment of string s for feature f • {s} = set of bases covered by string s (in either direction) • {f} = set of bases covering the feature • {fs} = intersection of {f} and {s} • {F} = union of {f’} for all features f’ • # = size of set • I plot log2(Vf,s ), fold enrichment • Or, if negative, fold depletion
Single-segment Enrichment 9 segway.k562.coordinated vs CSHL RNAseq contigs white = no occurrences
Length-4 Strings Enrichment 10 segway.k562.coordinated vs CSHL RNAseq contigs (highest observed/expected strings) white = no occurrences
Length-4 Strings Enrichment 11 segway.k562.coordinated vs CSHL RNAseq contigs (highest observed/expected strings)
To Do • Incorporate single-segment enrichment into evaluation of multi-segment strings • Longer strings • Run on all 14 round 8 segmentations • And the bake-off composites
Aligning Class Sequences • Work in progress, with these questions… • Do longer, highly similar sequences indicate similar function? • segway.k562.coordinated chr10:88422790-88427017 CYCNCYNCNYNCNCNCNCN • segway.k562.coordinated chr13:113696011-113701344 CYCNCYNCNYNCNCNCNCN • Or do small changes indicate functional differences? • segway.k562.coordinated chr10:133868081-133875219 NCNXnXNXNXNCYNCNCNCNXNCN • segway.k562.coordinated chr13:113638232-113645027- NCNXoXNXNXNCYNCNCNCNXNCN
Aligning Class Sequences • Do longer, highly similar sequences indicate similar function?
Aligning Class Sequences • Or do small changes indicate functional differences?
Alignments • Confounded by presence of 2- and 3-segment cycles • Implement separate search for short repeated cycles • Then align with those masked • Should incorporate segment lengths • May be better to align in peak space
Appendix • The following slides show single-segment enrichment heatmaps for all 14 round 8 segmentations
Single-segment Enrichment 18 segway.gm12878.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 19 segway.h1hesc.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 20 segway.helas3.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 21 segway.hepg2.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 22 segway.huvec.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 23 segway.k562.all vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 24 segway.k562.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 25 segway.tier1-2.coordinated vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 26 chromhmm.GM12878_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 27 chromhmm.H1_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 28 chromhmm.HELA_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 29 chromhmm.HEPG2_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 30 chromhmm.HUVEC_concatenate_25 vs CSHL RNAseq contigs white = no occurrences
Single-segment Enrichment 31 chromhmm.K562_concatenate_25 vs CSHL RNAseq contigs white = no occurrences