Comprehensive Quality Control for Illumina Sequencing Data

Quality Control of Illumina Data Mick Watson Director of ARK-Genomics The Roslin Institute

Quality scores

Quality scores • The sequencer outputs base calls at each position of a read • It also outputs a quality value at each position • This relates to the probability that that base call is incorrect • The most common Quality value is the Sanger Q score, or Phred score • Qsanger -10 * log10(p) • Where p is the probability that the call is incorrect • If p = 0.05, there is a 5% chance, or 1 in 20 chance, it is incorrect • If p = 0.01, there is a 1% chance, or 1 in 100 chance, it is incorrect • If p = 0.001, there is a 0.1% chance, or 1 in 1000 chance, it is incorrect • Using the equation: • p=0.05, Qsanger= 13 • p=0.01, Qsanger= 20 • p=0.001, Qsanger= 30

For the geeks…. • In R, you can investigate this: sangerq<- function(x) {return(-10 * log10(x))} sangerq(0.05) sangerq(0.01) sangerq(0.001) plot(seq(0,1,by=0.00001),sangerq(seq(0,1,by=0.00001)), type="l")

The plot

For the geeks…. • And the other way round…. qtop<- function(x) {return(10^(x/-10))} qtop(30) qtop(20) qtop(13) plot(seq(40,1,by=-1), qtop(seq(40,1,by=-1)), type="l")

The important stuff • Q30 – 1 in 1000 chance base is incorrect • Q20 – 1 in 100 chance base is incorrect

Quality encoding

Quality Encoding • Bioinformaticians do not like to make your life easy! • Q scores of 20, 30 etc take two digits • Bioinformaticians would prefer they only took 1 • In computers, letters have a corresponding ASCII code: • Therefore, to save space, we convert the Q score (two digits) to a single letter using this scheme

The process in full • p(probability base is wrong) : 0.01 • Q (-10 * log10(p)) : 30 • Add 33 : 63 • Encode as character : ?

For the geeks…. code2Q <- function(x) { return(utf8ToInt(x)-33) } code2Q(".") code2Q("5") code2Q("?") code2P <- function(x) { return(10^((utf8ToInt(x)-33)/-10)) } code2P(".") code2P("5") code2P("?")

QC of Illumina data

FastQC • FastQC is a free piece of software • Written by Babraham Bioinformatics group • http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • Available on Linux, Windows etc • Command-line or GUI

Read the documentation Follow the course notes

Per sequence quality • One of the most important plots from FastQC • Plots a box at each position • The box shows the distribution of quality values at that position across all reads

Obvious problems

Less obvious problems

Really bad problems

Other useful plots • Per sequence N content • May identify cycles that are unreliable • Over-represented sequences • May identify Illumina adapters and primers

Comprehensive Quality Control for Illumina Sequencing Data