Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome

Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome Peng et al. Nature Biotechnology (2012) doi:10.1038/nbt.2122 Presented by: GUAN Peiyong 23rd Feb 2012

Overview RNA Editing Concepts BGI’s Methodology • Definition • Mechanisms • Functions • Data • Pipeline • Results

Definition • Mechanisms • Functions RNA Editing Concepts

Gott & Emeson. Annu Rev Genet. 2000

RNA Editing | Definition • RNA editing can be broadly defined as any site-specific alteration in an RNA sequence that could have been copied from the template, excluding changes due to processes such as RNA splicing and polyadenylation. • RNA editing is a process that changes the identity of an RNA base after it has been transcribed from a DNA sequence. Gott & Emeson. Annu Rev Genet. 2000. E. C. Hayden, Nature 473, 432 (2011).

RNA Editing | Mechanisms • Insertion / deletion RNA editing • Posttranscriptional Nucleotide Insertion/Deletion • Nucleotide Deletion-Insertion • Nucleotide Insertion During Transcription • Mixed Nucleotide Insertion • Conversion / substitution editing • Adenosine-to-Inosine Editing (A-I or A-G, most prevalent in human) • Enzyme ADAR (adenosine deaminases that act on RNA). • Cytidine-to-Uridine Editing (C-U) • Enzyme APOBECs (apolipoprotein B mRNA editing enzymes, catalytic polypeptide-like). • E.g., CAA  UAA (STOP) • Uridine-to-Cytidine Editing (U-C) Li et al. Science doi:10.1126/science.1207018 ; 2011 Gott & Emeson. Annu Rev Genet. 2000

RNA Editing | Functions E. C. Hayden, Nature 473, 432 (2011).

Data • Pipeline • Results BGI Methodology

Data | Preparation 75bp and 100bp 90bp, strand-specific Lymphoblastoid Cell Line Illumina Genome Sequence Analyzer 767.58 million reads (73.84%) uniquely aligned

Data | Preparation • RNA-Seq of Lymphoblastoid cell line of a male Han Chinese individual (YH) • Genome sequence was reported previously. • Nature456, 60-65, (2008) • 767 million sequence reads • RNA-Seq • 75bp and 100bp Poly (A)+ • 90bp Poly (A) - - strand-specific sequencing • Small RNA-Seq

Data | Sequencing Coverage

Data | Sequencing Depth

Data | Simulated Data • Paired-end reads with fixed length of 75 bp, simulated randomly from chromosome 1 of the human RefSeq. • Use chromosome 1 of the NCBI human RefSeq as a reference: • Two sets of simulated data were created: • Set #1: Random SNV by MAQ with default options (5-, 10-, 20-, and 50-fold coverage). • Set #2: A→G substitution at positions referenced in the DARNED database (50-fold coverage).

Pipeline | Overview

Pipeline | Illumina reads alignment (SOAP2) • Due to the potential uncertainty in read alignment across splice junctions, SOAP2 was used in this regard rather than tools that utilize gapped alignment across exon boundary, such as SOAPsplice. • Reference genome (NCBI Build 36.1, hg18). • Two paired-reads – aligned together with both in the correct orientation. • Aligning the cDNA reads to the reference genome: • ≤ 3 mismatches for the 75-bp reads. • ≤ 4 mismatches for the 90-bp and 100-bp reads. • Best Hits – alignments with the least number of mismatches: • Uniquely placed – 1 best hit (kept). • Repeatedly placed – multiple equal best hits (discarded). • Potential PCR duplicates (discarded). • Reads with uniqueungapped genome alignment.

Pipeline | RNA editing sites/RNA-centric SNVs detection • Multiple filters with stringent thresholds to facilitate unbiased detection of bona fide editing or base substitution events in the RNA-Seq reads. • RNA-centric SNVs were first identified from aligned cDNA reads using SOAPsnp, which uses a method based on Bayes’ theorem (the reverse probability model) to call consensus genotype by carefully considering the data quality, alignment and recurring experimental errors, with parameters e = 0.0001 and r = 0.00005. • We further lifted a default filter in the basic filter step of the program that was designed to discard sequence reads with more than one variant within a 5-bp span (for clustered AG editing?). SNVs 10 filtering steps

Pipeline | 1. Basic filter • Retain SNVs that meet the following criteria: • Quality score of consensus genotype ≥ 20. • Covered depth ≥ 5. • Repeats (estimated copy number of the flanked sequence in genome) ≤ 1.

Pipeline | 2. Read parameter filter • Optimize parameters using simulated data set: • m, the minimal distance of a SNV site to its supporting reads’ ends • q, minimal sequencing quality score of SNV-corresponding nucleotide • (m, q)=(15,20) • n, minimal number of supporting reads that meet the above two cutoff parameters • n = 2

Pipeline | 2. Read parameter filter • Two sets of data: • Set #1: random substitution • Set #2: A G substitution in DARNED database

Pipeline | 3. RNA-DNA variants filter. • Focus on RNA-DNA variants only: • Sites of which DNA genotypes are the same as RNA genotypes were removed.

Pipeline | 4. YH genome variants filter. • Distinguish RNA editing from allele-specific expression and duplication polymorphisms: • Keep SNVs remaining from step 3 only if their corresponding DNA genotypes are homozygous and diploid in copy number. • Parameters of YH genome sequence reads corresponding to a candidate site: • Depth is ≥5; • Consensus quality is ≥20; • Average quality of the first best allele ≥ 20; • Depth of the second best allele, if present, is <5% of the total number of reads; • The second best allele should not be the variant allele in the RNA data; • And average sequencing quality of the second best allele is <10. • Exclude genomic duplication polymorphisms: • CNVnator tool with bin set to 50, and removed sites that were nondiploid in nature.

Pipeline | 5. MES filter. • Remove misaligned reads that arise from mapping error inherent to the mapping algorithm (MES): • Simulate read sequences based on all human genes (hg18 transcriptome) using MAQ without mutation (-r parameter). • Align simulated reads using SOAP2 & call SNVs using SOAPsnp. • Filter the identified SNVs using filters #3 and #4  MES. • SNVs matched the MES were removed.

Pipeline | 6. Strand filter. • Remove potential strand-specific errors in sequences generated by the Illumina platform: • Evaluate the counts of the reads mapped to the +/- strands using Fisher’s exact test. • Discard the site if: • Reads exhibited strand bias in distribution (P < 0.01) & • Number of supporting reads mapped to either strand is <2.

Pipeline | 7. BLAT filter. • Address the potential pitfall of paralogous sequences in site calling: • Use BLAT to search for SNVs’ supporting reads in the reference genome. • Same mismatch tolerance used in SOAP2 alignment. • Discard all supporting reads with >1 hit. • Filter SNVs that have <2 qualified supporting reads.

Pipeline | 8. Known SNPs filter. • Eliminate germline variants: • Cross-reference the remaining SNVs against known SNP databases: • 1000 Genomes Project. • Genomes of Yoruba, Watson, Korean. • dbSNP (version 129).

Pipeline | 9. Multiple type of mismatches filter. • Discard SNV candidate sites with >1 nonreference type: • For example: • Reference allele – A • Nonreference alleles – G and T

Pipeline | 10. Editing degree filter. • Exclude polymorphic sites with extreme degree of variation (100%): Remaining Sites Further Analysis

Pipeline | Analysis of the sequence and structural features of RNA editing. • To identify sites dsRNA structure, or sites in 3′-UTR that are likely microRNA seed matches: • Li, J.B. et al. Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing. Science 324, 1210–1213 (2009).

Pipeline | Analysis of the sequence and structural features of RNA editing. • Editing sites clustering: • Defined as occurrence of ≥3 variants per 100bp. • Conserved region: • Annotated as ‘most conserved’ by the UCSC genome browser. • Coding sequence: • Defined by the RefSeq annotation. • Highly edited genes: • ≥10 variant sites per gene • Gene enrichment: • DAVID pathway-classification tool.

Pipeline | Identification of miRNA and editing (1). • Filtering of small RNA reads: • Filter out low-quality reads; • Trim 3′ adaptor sequence by a dynamic programming algorithm; • Remove adaptor contaminations formed by adaptor ligation; • Retain only short trimmed reads of sizes from 18 to 30 nt.

Pipeline | Identification of miRNA and editing (2). • Annotate and categorize small RNAs: • Filter out small RNA reads possibly from known noncoding RNAs: • rRNA, tRNA, snRNA and snoRNA deposited in the Rfam database and the NCBI Genbank. • Discard small RNA reads assigned to exonic regions. • Subject the remaining small RNA to MIREAP, which identifies miRNA candidates according to the canonical hairpin structure and sequencing data.

Pipeline | Identification of miRNA and editing (3). • Align identified miRNA reads to miRNA reference sequences: • BLAST, ≤1 mismatch. • Reads that were uniquely aligned and overlapped with known miRNAs were used to identify miRNA editing sites. • First, identify reads with mismatch to hg18 genome. • Reads with mismatch within 1 nt at 5′ end or 2 nt at 3′ end were discarded (?). • Then, identify miRNA edits by the following criteria: • Sequencing depth of editing sites ≥ 5; • Frequency of SNV occurrence ≥5% & ≤95%; • Variants that were not found in previous SNP annotations • YH, 1000 genomes project, Yoruba, Watson, Korean and dbSNP version 129.

Results| Editing Events Identified • 22,688 RNA editing sites • Poly (A)+ • To ascertain the editing type for these sites, cross-reference against RefSeq. • ~30% of the identified sites: • Unannotated in the database (5,381). • Corresponded to overlapping transcript units on both strands (57). • 11,467 sites were unambiguously mapped to known gene models. • Poly (A)- • To identify editing sites in the intergenic regions of the transcriptome • 11,221 RNA editing sites identified.

Results| Editing Sites Distribution 50% leads to changes in coded amino acids.

Results| • Poly (A)+ • Poly (A)- • Editing sites • Characterization • Poly (A)+,CDS • Poly (A)+ • Poly (A)-

Results| Novel Editing Sites

Results| Frequency of nucleotides in the flanking sequences • Poly (A)+ • Poly (A)-

Results| % of Edits in Conserved Regions • Poly (A)+ • Poly (A)-

Results| Experimental Validation • Two replicates of PCR amplification and Sanger sequencing of both DNA and RNA from the same batch of cells from the YH cell line.

Results | Comparison with Other Datasets

Results | Genes with multiple editing sites.

Results | RNA editing and miRNA-mediated regulation • 2,474 editing sites in 3′-UTRs • Extract 6 + 1 + 6 bp sequence & search in miRBase.

Summary • Pipeline for identifying RNA editing events by screening RNA-DNA differences in the same individual. • 10 filters to handle various aspects of false positives. • Experimentally validated novel RNA editing sites. • Evidence of extensive RNA editing in a human cell line. • Question: since the model parameter were optimized using random data from DARNED, why there is no significant overlaps between DARNED database and BGI’s discovered editing sites?

RNA Editing

Overview • Literature Review • RNA Editing Concepts • Definition • Mechanisms • Functions • RNA Editing Site Prediction • Prediction Methods • Machine Learning Based Methods • Mapping Based Methods • Database

Literature Review • RNA Editing Concepts • RNA Editing Site Prediction

Definition • Mechanisms • Functions RNA Editing Concepts

Gott & Emeson. Annu Rev Genet. 2000

Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome

Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome

Presentation Transcript

RNA-Seq

RNA-Seq and transcriptome analysis

Expression A nalysis of RNA - seq Data

RNA- seq Analysis

RNA- Seq Lab

RNA seq (I)

Biases in RNA- Seq data

Le RNA-seq

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data

RNA- seq Analysis Practical Exercise

TOX680 Unveiling the Transcriptome using RNA- seq

Statistics for RNA- seq Analysis

RNA-Seq and transcriptome analysis

RNA-seq data

RNA-Seq datasets

Bioinformatics Pipelines for RNA- Seq Data Analysis

RNA- seq Analysis in Galaxy

RNA-Seq and Transcriptome A nalysis

RNA-SEQ