"Reference-Based Compression: Maximizing Efficiency in NGS Data Storage"

CRAM: reference-based compression format developed by Vadim Zalunin

Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file

The need for compression Red alert

Compression, what is it? BMP, 190 kb PNG, 100 kb JPG, 21 kb JPG, 4 kb LOSSLESS LOSSY

Compression, when we know what to expect. BMP, 145 kb PNG, 2 kb JPG, 6 kb JPG, 3 kb LOSSLESS LOSSY But the actual message is only 40 characters (bytes) long!

Compression at it’s best "Five little ducks went swimming one day" compress uncompress IMAGE, 145 kb TEXT, 40 b IMAGE, 145 kb ~3500 times more efficient

What are we talking about bug The bug’s DNA is hidden somewhere sample sequencing machines bunch of huge files

Looking closer at the data It boils down to a long list of reads: read 1 read 2 read 3 ….. read bizzilion bunch of huge files Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates.

What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.

What is a Read? read name @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file.

What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file. Bases: ACGTN

What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126)

What is quality score? Then quality score is phred quality score encoded as ASCII symbols 33-126. Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good.

Reference based encoding Read start position Read end position

Reference based encoding

Reference based encoding Mismatching bases

Lossy quality scores horizontal Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. vertical

Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM

Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM Some analysis pipeline Some analysis pipeline

Comparison study:1K Genomes exomes compress uncompress BAM CRAM BAM Some analysis pipeline Some analysis pipeline Original SNPs Restored SNPs

Comparison study:1K Genomes exomes

CRAM NGS data compression CRAM lossless CRAM lossy CRAM very lossy Untreated Bits/base (bad) (good) Do nothing Lossless Lossy

20-fold Lossless 200-fold 2-fold Progressive application of compression Sample accessibility Hard Easy Low High Sample value

References More information: • http://www.ebi.ac.uk/ena/about/cram_toolkit Mailing list: • http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev Publications: • Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40 • Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1

"Reference-Based Compression: Maximizing Efficiency in NGS Data Storage"