Next Gen Sequencing Data
Next Gen Sequencing Data. The FASTQ file format is the current standard for next generation sequencer output. This is the format for the Illumina Genome Analyzer. FASTQ Format . ASCII text No standard file extension: but .fq .fastq and .txt are commonly used 4 lines per sequence
Next Gen Sequencing Data
E N D
Presentation Transcript
Next Gen Sequencing Data • The FASTQ file format is the current standard for next generation sequencer output. • This is the format for the Illumina Genome Analyzer
FASTQ Format • ASCII text • No standard file extension: but .fq .fastq and .txt are commonly used • 4 lines per sequence • Line 1 begins with the @ character, a sequence ID, and an optional description • Similar to the > line in a FASTA file • Line 2 is the sequence letters • Line 3 begins with the + character, followed by the same sequence ID, and another optional description • Line 4 encodes quality values for the sequence letters in line 2 • Must contain the same number of characters as the sequence in line 2
FASTQ example @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***+*''))**55CCF>>>>>>CCCCCCC65
FASTQ format • Line 2 (sequence characters) and Line 4 (quality value characters) may be wrapped (split over multiple lines). • Wrapping is discouraged because it makes parsing more complicated
Illumina Sequence ID format • Illumina uses a special format for their sequene Ids: @HWUSI-EAS100R:6:73:941:1973#0/1 • HWUSI-EAS100R: the unique instrument name • 6: the flowcell lane • 73: tile number within the flowcell lane • 941: 'x'-coordinate of the cluster within the tile • 1973: 'y'-coordinate of the cluster within the tile • #0: index number for a multiplexed sample (0 for no indexing) • /1: the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
Quality Values • Quality values are encoded differently on different next gen sequencers, • Illumina uses pipeline software originally developed by Solexa to determine quality values for each base in a seqeunce. • Older versions of the Illumina pipeline software used different equations to calculate quality. • Current software (v. 1.5) uses the Phred (i.e. Sanger) scoring scheme • A Phred score of a base is Qphred =(-10)*log10(e) • e is the estimated probability of a base being wrong
Encoding Phred Quality Scores • Phred scores are presented on Line 4 of a FASTQ file: !''*((((***+))%%%++)(%%%%).1***+*''))**55CCF>>>>>>CCCCCC65 • These characters are the ASCII value found by adding 64 to the Phred Quality score • In version 1.3 of the Illumina software, the Phred values 0-62 can be encoded as ASCII 64-126 • Values greater than 40 are not expected in raw read data • In the newest version of the Illumina pipeline software, Phred scores 0 and 1 are no longer used. A Phred score of 2 (ASCII 64, ‘B’) is now only used at the end of a read. • Phred 2 is now a read segment quality control indicator
Format Conversion • Because each version of the Illumina pipeline software uses a different Quality value scheme, file conversion software is sometimes necessary. • There are many conversion options, BioPerl v. 1.6.1+ can convert Sanger(Phred), Solexa (Illumina 1.0), and Illumina 1.3+ files. • Other options are Biopython, EMBOSS, BioRuby, and MAQ
Storage Requirements • Large sequencing centers have Terabytes and sometimes Petabytes of sequence data that must be analyzed. • This data is in ASCII or other plain text formats • A new encoding method called G-SQZ (Genomic SQeeZ) has recently been invented. • This method can compress sequence data as much as 80% • G-SQZ can encode ACGT frequencies, annotation information, data quality, erroneous entries (unidentified bases) • G-SQZ allows data access at regular intervals, such as every millionth base • All of the information does not have to be decoded from the start • Multiple computer processes could decode and process different chunks of the data simultaneously
References • New Technology Reduces Storage Needs and Costs for Genomic Data http://www.sciencedaily.com/releases/2010/07/100706150614.htm • The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants http://nar.oxfordjournals.org/content/38/6/1767