1 / 11

RNA-Seq datasets

RNA-Seq datasets. Dan Lawson. New buzz word (old data). In the beginning there were ESTs... and then there was Roche 454.. and then Solexa/Illumina. Why do we generate data sets? Who is producing data sets? Where do we obtain these? What can we use them for? How do we organise these?.

myrna
Télécharger la présentation

RNA-Seq datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RNA-Seq datasets Dan Lawson

  2. New buzz word (old data) • In the beginning there were ESTs... • and then there was Roche 454.. • and then Solexa/Illumina. • Why do we generate data sets? • Who is producing data sets? • Where do we obtain these? • What can we use them for? • How do we organise these? VectorBase 2012 2

  3. Why do we produce RNA-Seq data sets? • Access to the transcriptome of an organism (speed v cost) • Technical issues with the genome of that species (size, repeat content) • Quantification of gene expression levels (absolute & relative) • Analysis of these data sets both require and can deliver improvements to the quality of the predicted gene structures VectorBase 2012 3

  4. Who is producing RNA-Seq data sets? • Almost all de novo genome sequencing projects in order to produce a substrate for gene prediction • Large studies (such as the Vosshall and Krzywinski DBPs) • Small studies (such as Zweibel chemosensors) • XXXXX[orgn] AND study_type_transcriptome_analysis[prop] VectorBase 2012 4

  5. RNA-Seq data sets in VectorBase • We do not want to be the archival database for these data sets (as they are large and will be very common) • We do want to identify important sets and present some level of processed/analysed data • All sets require some level of QC/filtering • All sets require alignment back to a reference genome • Default aligner has been bowtie (but we know this is sub-optimal) • Other aligners used include inchworm, gsnap, bwa • Output is a BAM file • Use SAMtools to index the BAM files (so that Ensembl tools can use these sets, tools are a viewer and slicer) • {To Do} Move indexed BAM files on FTP site VectorBase 2012 5

  6. Using RNA-Seq data: Gene prediction • Aligned RNA-Seq data sets provide • Coverage plots which can be processed to transfrags • Exon-Intron junction data • Use in automated annotation (MAKER) • Requires assembly/clustering for performance issues • Useful for providing training data for ab initio predictiors • transfrags should be used with caution in early rounds of MAKER • Use in manual annotation (Apollo/Artemis) • Identification of novel predictions, exons • Confirmation/correction of intron junction data • Manual inclusion of UnTranslated Regions (UTRs) VectorBase 2012 6

  7. Using RNA-Seq data: Gene expression • Use the abundance of reads in an RNA-Seq experiment to assay the level of expression for a locus • Requires: • Aligned RNA-Seq data sets (BAM) • Annotation sets (GFF/GTF) • Processed to give FPKM/RPKM values for expression levels • Storage of these data in BASE2/GDAV (as discussed by Bob yesterday) VectorBase 2012 7

  8. RNA-Seq visualization of coverage • BAM viewer (VectorBase) • Good for single (or small number of lanes) • Flexible, user chooses which experiments to visualize • Becomes slow and unwieldy with a medium-large number of lanes • Multiple experiments (FlyBase) • Good for multiple experiments • Pre-defined set of experiments • Fast response time VectorBase 2012 8

  9. VectorBase 2012 9

  10. RNA-Seq questions #1 • Given limited space/speed • What are the key experiments we can support? • Criteria fo defining these? • Pre/post publication data sets? • Shelf life for an RNA-Seq experiment? • How do we aggregate across different experiments? • Coverage/Junctions • By species, developmental stage, body part, condition VectorBase 2012 10

More Related