1 / 92

Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome

Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Peng et al. Nature Biotechnology (2012) doi:10.1038/nbt.2122. Presented by: GUAN Peiyong 23 rd Feb 2012. Overview. RNA Editing Concepts. BGI’s Methodology. Definition Mechanisms Functions.

alban
Télécharger la présentation

Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome Peng et al. Nature Biotechnology (2012) doi:10.1038/nbt.2122 Presented by: GUAN Peiyong 23rd Feb 2012

  2. Overview RNA Editing Concepts BGI’s Methodology • Definition • Mechanisms • Functions • Data • Pipeline • Results

  3. Definition • Mechanisms • Functions RNA Editing Concepts

  4. Gott & Emeson. Annu Rev Genet. 2000

  5. RNA Editing | Definition • RNA editing can be broadly defined as any site-specific alteration in an RNA sequence that could have been copied from the template, excluding changes due to processes such as RNA splicing and polyadenylation. • RNA editing is a process that changes the identity of an RNA base after it has been transcribed from a DNA sequence. Gott & Emeson. Annu Rev Genet. 2000. E. C. Hayden, Nature 473, 432 (2011).

  6. RNA Editing | Mechanisms • Insertion / deletion RNA editing • Posttranscriptional Nucleotide Insertion/Deletion • Nucleotide Deletion-Insertion • Nucleotide Insertion During Transcription • Mixed Nucleotide Insertion • Conversion / substitution editing • Adenosine-to-Inosine Editing (A-I or A-G, most prevalent in human) • Enzyme ADAR (adenosine deaminases that act on RNA). • Cytidine-to-Uridine Editing (C-U) • Enzyme APOBECs (apolipoprotein B mRNA editing enzymes, catalytic polypeptide-like). • E.g., CAA  UAA (STOP) • Uridine-to-Cytidine Editing (U-C) Li et al. Science doi:10.1126/science.1207018 ; 2011 Gott & Emeson. Annu Rev Genet. 2000

  7. RNA Editing | Functions E. C. Hayden, Nature 473, 432 (2011).

  8. Data • Pipeline • Results BGI Methodology

  9. Data | Preparation 75bp and 100bp 90bp, strand-specific Lymphoblastoid Cell Line Illumina Genome Sequence Analyzer 767.58 million reads (73.84%) uniquely aligned

  10. Data | Preparation • RNA-Seq of Lymphoblastoid cell line of a male Han Chinese individual (YH) • Genome sequence was reported previously. • Nature456, 60-65, (2008) • 767 million sequence reads • RNA-Seq • 75bp and 100bp Poly (A)+ • 90bp Poly (A) - - strand-specific sequencing • Small RNA-Seq

  11. Data | Sequencing Coverage

  12. Data | Sequencing Depth

  13. Data | Simulated Data • Paired-end reads with fixed length of 75 bp, simulated randomly from chromosome 1 of the human RefSeq. • Use chromosome 1 of the NCBI human RefSeq as a reference: • Two sets of simulated data were created: • Set #1: Random SNV by MAQ with default options (5-, 10-, 20-, and 50-fold coverage). • Set #2: A→G substitution at positions referenced in the DARNED database (50-fold coverage).

  14. Pipeline | Overview

  15. Pipeline | Illumina reads alignment (SOAP2) • Due to the potential uncertainty in read alignment across splice junctions, SOAP2 was used in this regard rather than tools that utilize gapped alignment across exon boundary, such as SOAPsplice. • Reference genome (NCBI Build 36.1, hg18). • Two paired-reads – aligned together with both in the correct orientation. • Aligning the cDNA reads to the reference genome: • ≤ 3 mismatches for the 75-bp reads. • ≤ 4 mismatches for the 90-bp and 100-bp reads. • Best Hits – alignments with the least number of mismatches: • Uniquely placed – 1 best hit (kept). • Repeatedly placed – multiple equal best hits (discarded). • Potential PCR duplicates (discarded). • Reads with uniqueungapped genome alignment.

  16. Pipeline | RNA editing sites/RNA-centric SNVs detection • Multiple filters with stringent thresholds to facilitate unbiased detection of bona fide editing or base substitution events in the RNA-Seq reads. • RNA-centric SNVs were first identified from aligned cDNA reads using SOAPsnp, which uses a method based on Bayes’ theorem (the reverse probability model) to call consensus genotype by carefully considering the data quality, alignment and recurring experimental errors, with parameters e = 0.0001 and r = 0.00005. • We further lifted a default filter in the basic filter step of the program that was designed to discard sequence reads with more than one variant within a 5-bp span (for clustered AG editing?). SNVs 10 filtering steps

  17. Pipeline | 1. Basic filter • Retain SNVs that meet the following criteria: • Quality score of consensus genotype ≥ 20. • Covered depth ≥ 5. • Repeats (estimated copy number of the flanked sequence in genome) ≤ 1.

  18. Pipeline | 2. Read parameter filter • Optimize parameters using simulated data set: • m, the minimal distance of a SNV site to its supporting reads’ ends • q, minimal sequencing quality score of SNV-corresponding nucleotide • (m, q)=(15,20) • n, minimal number of supporting reads that meet the above two cutoff parameters • n = 2

  19. Pipeline | 2. Read parameter filter • Two sets of data: • Set #1: random substitution • Set #2: A G substitution in DARNED database

  20. Pipeline | 3. RNA-DNA variants filter. • Focus on RNA-DNA variants only: • Sites of which DNA genotypes are the same as RNA genotypes were removed.

  21. Pipeline | 4. YH genome variants filter. • Distinguish RNA editing from allele-specific expression and duplication polymorphisms: • Keep SNVs remaining from step 3 only if their corresponding DNA genotypes are homozygous and diploid in copy number. • Parameters of YH genome sequence reads corresponding to a candidate site: • Depth is ≥5; • Consensus quality is ≥20; • Average quality of the first best allele ≥ 20; • Depth of the second best allele, if present, is <5% of the total number of reads; • The second best allele should not be the variant allele in the RNA data; • And average sequencing quality of the second best allele is <10. • Exclude genomic duplication polymorphisms: • CNVnator tool with bin set to 50, and removed sites that were nondiploid in nature.

  22. Pipeline | 5. MES filter. • Remove misaligned reads that arise from mapping error inherent to the mapping algorithm (MES): • Simulate read sequences based on all human genes (hg18 transcriptome) using MAQ without mutation (-r parameter). • Align simulated reads using SOAP2 & call SNVs using SOAPsnp. • Filter the identified SNVs using filters #3 and #4  MES. • SNVs matched the MES were removed.

  23. Pipeline | 6. Strand filter. • Remove potential strand-specific errors in sequences generated by the Illumina platform: • Evaluate the counts of the reads mapped to the +/- strands using Fisher’s exact test. • Discard the site if: • Reads exhibited strand bias in distribution (P < 0.01) & • Number of supporting reads mapped to either strand is <2.

  24. Pipeline | 7. BLAT filter. • Address the potential pitfall of paralogous sequences in site calling: • Use BLAT to search for SNVs’ supporting reads in the reference genome. • Same mismatch tolerance used in SOAP2 alignment. • Discard all supporting reads with >1 hit. • Filter SNVs that have <2 qualified supporting reads.

  25. Pipeline | 8. Known SNPs filter. • Eliminate germline variants: • Cross-reference the remaining SNVs against known SNP databases: • 1000 Genomes Project. • Genomes of Yoruba, Watson, Korean. • dbSNP (version 129).

  26. Pipeline | 9. Multiple type of mismatches filter. • Discard SNV candidate sites with >1 nonreference type: • For example: • Reference allele – A • Nonreference alleles – G and T

  27. Pipeline | 10. Editing degree filter. • Exclude polymorphic sites with extreme degree of variation (100%): Remaining Sites Further Analysis

  28. Pipeline | Analysis of the sequence and structural features of RNA editing. • To identify sites dsRNA structure, or sites in 3′-UTR that are likely microRNA seed matches: • Li, J.B. et al. Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing. Science 324, 1210–1213 (2009).

  29. Pipeline | Analysis of the sequence and structural features of RNA editing. • Editing sites clustering: • Defined as occurrence of ≥3 variants per 100bp. • Conserved region: • Annotated as ‘most conserved’ by the UCSC genome browser. • Coding sequence: • Defined by the RefSeq annotation. • Highly edited genes: • ≥10 variant sites per gene • Gene enrichment: • DAVID pathway-classification tool.

  30. Pipeline | Identification of miRNA and editing (1). • Filtering of small RNA reads: • Filter out low-quality reads; • Trim 3′ adaptor sequence by a dynamic programming algorithm; • Remove adaptor contaminations formed by adaptor ligation; • Retain only short trimmed reads of sizes from 18 to 30 nt.

  31. Pipeline | Identification of miRNA and editing (2). • Annotate and categorize small RNAs: • Filter out small RNA reads possibly from known noncoding RNAs: • rRNA, tRNA, snRNA and snoRNA deposited in the Rfam database and the NCBI Genbank. • Discard small RNA reads assigned to exonic regions. • Subject the remaining small RNA to MIREAP, which identifies miRNA candidates according to the canonical hairpin structure and sequencing data.

  32. Pipeline | Identification of miRNA and editing (3). • Align identified miRNA reads to miRNA reference sequences: • BLAST, ≤1 mismatch. • Reads that were uniquely aligned and overlapped with known miRNAs were used to identify miRNA editing sites. • First, identify reads with mismatch to hg18 genome. • Reads with mismatch within 1 nt at 5′ end or 2 nt at 3′ end were discarded (?). • Then, identify miRNA edits by the following criteria: • Sequencing depth of editing sites ≥ 5; • Frequency of SNV occurrence ≥5% & ≤95%; • Variants that were not found in previous SNP annotations • YH, 1000 genomes project, Yoruba, Watson, Korean and dbSNP version 129.

  33. Results| Editing Events Identified • 22,688 RNA editing sites • Poly (A)+ • To ascertain the editing type for these sites, cross-reference against RefSeq. • ~30% of the identified sites: • Unannotated in the database (5,381). • Corresponded to overlapping transcript units on both strands (57). • 11,467 sites were unambiguously mapped to known gene models. • Poly (A)- • To identify editing sites in the intergenic regions of the transcriptome • 11,221 RNA editing sites identified.

  34. Results| Editing Sites Distribution 50% leads to changes in coded amino acids.

  35. Results| • Poly (A)+ • Poly (A)- • Editing sites • Characterization • Poly (A)+,CDS • Poly (A)+ • Poly (A)-

  36. Results| Novel Editing Sites

  37. Results| Frequency of nucleotides in the flanking sequences • Poly (A)+ • Poly (A)-

  38. Results| % of Edits in Conserved Regions • Poly (A)+ • Poly (A)-

  39. Results| Experimental Validation • Two replicates of PCR amplification and Sanger sequencing of both DNA and RNA from the same batch of cells from the YH cell line.

  40. Results | Comparison with Other Datasets

  41. Results | Genes with multiple editing sites.

  42. Results | RNA editing and miRNA-mediated regulation • 2,474 editing sites in 3′-UTRs • Extract 6 + 1 + 6 bp sequence & search in miRBase.

  43. Summary • Pipeline for identifying RNA editing events by screening RNA-DNA differences in the same individual. • 10 filters to handle various aspects of false positives. • Experimentally validated novel RNA editing sites. • Evidence of extensive RNA editing in a human cell line. • Question: since the model parameter were optimized using random data from DARNED, why there is no significant overlaps between DARNED database and BGI’s discovered editing sites?

  44. RNA Editing

  45. Overview • Literature Review • RNA Editing Concepts • Definition • Mechanisms • Functions • RNA Editing Site Prediction • Prediction Methods • Machine Learning Based Methods • Mapping Based Methods • Database

  46. Literature Review • RNA Editing Concepts • RNA Editing Site Prediction

  47. Definition • Mechanisms • Functions RNA Editing Concepts

  48. Gott & Emeson. Annu Rev Genet. 2000

More Related