Spliced Transcripts Alignment & Reconstruction

STAR Spliced Transcripts Alignment & Reconstruction Alexander Dobin, Philippe Batut, Sudipto Chakrabortty, Carrie Davis, Delphine Fagegaltier, Sonali Jha, Wei Lin, Felix Schlesinger, Chenghai Xue, Christopher Zaleski, Thomas Gingeras CSHL

STAR: spliced transcript alignment and reconstruction • 'Ab initio' detection of splice junctions • un-annotated, non-canonical, distal exons, chimeric ... • Any read length, any number of SJs per read • Any (reasonable) number of mismatches and indels • Unique and all multiple mappers • Alignment scoring utilizing reads quality scores • "Auto" trimming of poor quality ends • Non-templated poly-A tails detection • Very Fast: human 75-mer reads: 60 Million read per hour • Memory: RAM~9*(Genome length) bytes: 25GB for human II. Algorithm

Maximum mappable length • Typical short read aligner: does the read map entirely, i.e. at full length? • What is the maximum mappable length? • can detect many mismatches • can precisely "trim" poor quality tails • can detect splice junctions • With suffix arrays we find maximum mappable length in no extra time Map Extend Map Map Map again II. Algorithm

Scoring with quality scores • Similar to local alignment scoring, but penalties have probabilistic meaning • Illumina quality score: • +QS for matches; -QS for mismatches • Penalty for gap opening: • Total score • A more elaborate iterative penalty system is being developed • gap penalty is calculated from mapped gap length distribution • mismatch penalties vs QS scores are re-calibrated after mapping • Choose the alignment(s) with highest score II. Algorithm

STAR alignment algorithm • Split each read into "good" pieces by quality scores • Map good pieces using suffix arrays • Stitch and extend mapped pieces • Score and select the best alignment

Splitting the reads • Split the read at poor quality bases (QS<15), 'N' • Map each good piece separately • Recover mismatches caused by poor SNR • Avoid erroneous mapping caused by sequencing errors: • just 1 SNP can cause mis-mapping from paralog to paralog

Suffix array based search • For each good piece • find maximum exactly mappable length (could be a multiple mapper) • if a long portion of the good piece is still unmapped - repeat • repeat this procedure backwards (from 3' to 5' of a good piece)

Stitch and extend mapped pieces • Each uniquely mapped piece originates an alignment window (cluster) • Collect all mapped pieces within an alignment window (e.g. 200kb) • Consider all collinear combinations of mapped pieces • Choose the combination with the highest score for each cluster • Choose the alignment cluster with the highest score Stitch Extend Extend

Comparison with exhaustive search Fly embryo 76mer RNA seq 1 Illumina lane: 8,930,945 total reads, good quality Multiple mappers by exhaustive search, <0.002% of all reads STAR maps 99.8% of all exhaustively mapped reads poor quality reads which did not have a single unique "anchor" III. Application

Reads mapped by STAR 1.5% multi-mappers 8.5% STAR splice junctions 1.8% not mapped by STAR 0.2% STAR InDels gap < 20b 11% STAR >2MM or shorter length 77% STAR overlap with exhaustive search III. Application

STAR alignments ~1,000,000 alignments found by STAR and not by exhaustive search Distribution of mapped lengths mean length = 72 Distribution of mismatches spliced portions poor quality tails III. Application

Benchmarks Single thread benchmarks 75-mer reads Bowtie (-v2 -k1) only reports non-spliced alignments with 0-2 MM, 1 or 2 alignments per read BLAT and STAR report >2MM and spliced alignments, and all the multiple alignments Million of reads aligned per hour III. Application

Human K562/GM: 2x75

Splice junctions Total # of Gencode junctions: 284k Canonical Annotated Number of junctions Canonical Un-Annotated Non-Canonical Un-Annotated Minimum number of reads per junction

Transcript assembly algorithm • Use contigs and splice junctions only • Find all possible collinear maximally extended transcripts • by following all possible paths

Examples of transcripts STAR transcripts

Summary • STAR: ab initio splice junction detection • Maximum mappable length search with suffix arrays • Alignment scoring uses quality scores of the reads • Very fast: 60M/hour for 75-mer reads in human, requires large amount of RAM (~25GB for human) • The code will be beta-released in November '09 dobin@cshl.edu

Examples of transcripts STAR transcripts

Chimeric stitching READ Best Mapped Cluster Another Mapped Cluster chr1 chr2 If the Best Mapped Cluster leaves enough un-mapped read space, try to stitch other clusters that cover the unmapped space II. Algorithm

Spliced Transcripts Alignment & Reconstruction