170 likes | 311 Vues
Formats and standards for sequencing data. Mat úš Kalaš INF389, CBU, BCCS /UiB, Bergen Nov 12, 2010. SHRiMP Maq BWA Bowtie RMAP Eland SOAP SOAP2 MOSAIK SOCS PatMaN ZOOM PASS PerM RazerS segemehl MPSCAN BFAST Lastz BLAT. 454 Solexa/Illumina SOLiD …. Genome Metagenome
E N D
Formats and standards for sequencing data Matúš Kalaš INF389, CBU, BCCS/UiB, Bergen Nov 12, 2010
SHRiMP Maq BWA Bowtie RMAP Eland SOAP SOAP2 MOSAIK SOCS PatMaN ZOOM PASS PerM RazerS segemehl MPSCAN BFAST Lastz BLAT 454 Solexa/Illumina SOLiD … Genome Metagenome Gene annotation Gene expression Binding sites Variation … Celera Newbler Velvet Euler SOAPdenovo … GenBank EMBL DDBJ Genome Catalogue SNPdb … NCBI SRA EMBL-EBI ENA Your databases
454 output formats • .sff • .fna • .qual
Illumina output formats • .seq.txt • .prb.txt • Illumina FASTQ (ASCII – 64 is Illumina score) • Qseq • (ASCII – 64 is Phred score) • Illumina single line format • SCARF
SOLiD output format(s) • CSFASTA
Real (“standard”) FASTQ = Sanger FASTQ(ASCII – 33 is Phred score)
Example of dealing with diverse read formats: • in Galaxy(http://usegalaxy.org)
If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI ENA at EMBL-EBI • SRA format (XML) • SRF format Or should they be deleted?
Common (“standard”) format for read alignments: • SAM • BAM(= binary SAM)
Some common formats for results:(Genome/Gene annotation) • BED format (genome-browser tracks) • GFF format (gene/genome features) • BioXSD (XML) (any annotation; under development)
Deposit genome/metagenome in a public repository: INSDC databases: GenBank, EMBL, DDBJ Deposit genome/metagenome metadata: • MIGS/MIMSstandard by GSC • GCDMLformat (XML) (under development) • following the MIGS/MIMS standard
MIGS: Minimum Information about a Genome SequenceMIMS: Minimum Information about a Metagenome Sequence/Sample
Sequencing experiment metadata: • MINSEQEstandard by FGED • Minimum Information about a high-throughput • Nucleotide SEQuencing Experiment • (under development)
Take-home messages: • Use raw sequencing data when possible • For base-call data, use “standard” FASTQ (Sanger, Phred) • For read alignments, use SAM/BAM format • Use common formats for your results (e.g. GFF or BED format) • Hope for new, generic, extensible standard format(s) • Submit MIGS/MIMS-compliant metadata of genome sequences • Keep an eye on MINSEQE standard, store your sequencing metadata