1 / 68

Genome Sequencing: Technology and Strategies

Genome Sequencing: Technology and Strategies. Chuong Huynh NIH/NLM/NCBI huynh@ncbi.nlm.nih.gov. Acknowledgement: Daniel Lawson (Sanger Institute) and Jane Carlton (TIGR). Bioinformatics Flow Chart. 1a. Sequencing. 1b. Analysis of nucleic acid seq. 6. Gene & Protein expression data.

Télécharger la présentation

Genome Sequencing: Technology and Strategies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBI huynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane Carlton (TIGR)

  2. Bioinformatics Flow Chart 1a. Sequencing 1b. Analysis of nucleic acid seq. 6. Gene & Protein expression data 7. Drug screening 2. Analysis of protein seq. 3. Molecular structure prediction Ab initio drug design OR Drug compound screening in database of molecules 4. molecular interaction 8. Genetic variability 5. Metabolic and regulatory networks

  3. How to sequence a genome • development of sequencing strategy and source of funding • procurement of DNA and initial library construction • test sequencing • large-scale random sequencing of small (2-3 kb), medium (10 kb) and large (>50 kb) libraries • analysis of raw sequence data by: BLAST, RepeatFinder etc • release of genome data onto sequencing center website • at 8-10 X coverage, random stops • closure of sequence gaps and physical gaps • comparison to physical map • gene model prediction • final gene model annotation • release of data to GenBank and publication

  4. Full shotgun sequencing Marker1 Marker2 Genomic DNA large insert library (20 - 500 kb) Minimal tiling path shotgun library: small (2-3 kb) and medium (10 kb) Sequencing (8-10 X) Assembly scaffold contig Gap closure gene prediction, annotation and analysis

  5. Partial shotgun sequencing Genomic DNA shotgun library: small (2-3 kb) and medium (10 kb) Sequencing (5X) Assembly contig scaffold Analysis

  6. Genome sequencing terms Raw sequence: unassembled sequence reads produced from sequencing of inserts from individual recombinant clones of a genomic DNA library. Finished sequence: complete sequence of a genome with no gaps and an accuracy of > 99.9%. Genome coverage: average number of times a nucleotide is represented by a high-quality base in random raw sequence. Full shotgun coverage: genome coverage in random raw sequence required to produce finished sequence, usually 8-10 fold (‘8-10X’). Partial shotgun coverage: typically 3-6X random coverage of a genome which produces sequence data of sufficient quality to enable gene identification but which is not sufficient to produce a finished genome sequence Paired reads: sequence reads determined from both ends of a cloned insert in a recombinant clone. Contig: contiguous DNA sequence produced from joining overlapping raw sequence reads. Singleton: single sequence read that cannot be joined (‘assembled’) into a contig. Scaffold: a group of ordered and orientated contigs known to be physically linked to each other by paired read information. EST: expressed sequence tag generated by sequencing one end of a recombinant clone from a cDNA library. ESTs are single-pass reads and therefore prone to contain sequence errors. GSS: genome survey sequence generated by sequencing one end of a recombinant clone from a genomic DNA library. The genomic DNA library can in some instances be enriched for the presence of coding regions, for example through use of mung bean nuclease digestion of genomic DNA prior to cloning. SNP: single nucleotide polymorphism ORF: open reading frame, stretches of codons in the same reading frame uninterrupted by STOP codons and calculated from a six-frame translation of DNA sequence.

  7. Jan 2003

  8. NCBI Trace Archive Sep 23, 2003

  9. Strategy Libraries Sequencing Assembly Closure Annotation Release Large-scale genome projects • Sequencing DNA molecules in the Mb size range • All strategies employ the same underlying principles: • Random Shotgun sequencing

  10. Genomic DNA Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence

  11. Strategy Libraries Sequencing Assembly Closure Annotation Release Strategies for sequencing • How big can you go?? • Large-insert clones • cosmids 30-40 kb • BACs/PACs 50 - 100 kb • Whole chromosomes • Whole genomes

  12. Genome size and sequencing strategies Genome size (log Mb) 4 0 1 2 3 H.sapiens (3000 Mb) D.melanogaster (170 Mb) C.elegans (100Mb) P.falciparum (30 Mb) S.cerevisiae (14 Mb) E.coli (4 Mb) Whole genome shotgun (WGS) Clone-by-clone Whole Chromosome Shotgun (WCS) Whole Genome Shotgun (WGS) with Clone ‘skims’

  13. Genomic DNA Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence

  14. Strategy Libraries Sequencing Assembly Closure Annotation Release Strategies for sequencing • Size and GC composition of genome • Volume of data • Ease of cloning • Ease of sequencing • Genome complexity • dispersed repetitive sequence • telomeres & centromeres • Politics/Funding

  15. Strategy Libraries Sequencing Assembly Closure Annotation Release Strategies: Clone by Clone • Simple (0.5 - 2 K reads) • Few problems with repeats • Relatively simple informatics • Scalability • Quality of physical map • Fingerprint / STS maps • End sequencing

  16. Strategy Libraries Sequencing Assembly Closure Annotation Release Strategies: Whole Chromosome shotgun (WCS) • Requires chromosome isolation • Moderate complexity (10’s K reads) • Problems with repeats • Complex informatics • Inefficient in isolation • Quality of physical map (want good physical map) • Skims of mapped clones

  17. Strategy Libraries Sequencing Assembly Closure Annotation Release Strategies: Whole Genome shotgun (WGS) • Moderate to High complexity (10-100’s K reads) • Massive Problems with repeats • Complex informatics • Quality of physical map • Fingerprint map • STS markers • End-sequences • Skims of mapped clones

  18. Strategy Libraries Sequencing Assembly Closure Annotation Release Sequencing my genome Politics Production Finishing Annotation TIME MONEY

  19. Strategy Libraries Sequencing Assembly Closure Annotation Release What do you get? DATA!!, DATA !!, and more DATA!! • Sequence • incomplete complete • First-pass annotation • Gene discovery • Full annotation • A starting point for research

  20. ORFeome based functional genomics RNAi phenotypes Gene Knockout Expression Microarray Genome annotation is central to functional genomics

  21. Where is the problem? • Most genome will be sequenced and can be sequenced; few problem are unsolvable. • Problems lies in understanding what you have: • gene prediction • annotation

  22. Strategy Libraries Sequencing Assembly Closure Annotation Release Sequencing • Library construction • Colony picking (random) • DNA preparation (isolate DNA) • Sequencing reactions • Electrophoresis • Tracking/Base calling

  23. Strategy Libraries Sequencing Assembly Closure Annotation Release Libraries • Essentially Sub-cloning • Generation of small insert libraries in a well characterised vector. • Ease of propagation • Ease of DNA purification • e.g. puc18, M13

  24. Strategy Libraries Sequencing Assembly Closure Annotation Release Libraries - testing • Simple concepts • Insert/Vector ratio (Blue/White ratio) • Real data • Insert size • Sequence …. • Simple analysis

  25. Strategy Libraries Sequencing Assembly Closure Annotation Release Sequence generation • Pick colonies growth medium • Template preparation (DNA isolation) • Sequence reactions • Standard terminator chemistry • pUC libraries sequenced with forward and reverse primers • Tracking and noise

  26. Strategy Libraries Sequencing Assembly Closure Annotation Release Sequence generation • Electrophoresis of products • Old style - slab gels, 32 > 64 > 96 lanes • New style - capillary gels, 96 lanes • Transfer of gel image to UNIX • Sequencing machines use a slave Mac/PC • Move data to centralised storage area for processing

  27. Strategy Libraries Sequencing Assembly Closure Annotation Release Gel image processing • Light-to-Dye estimation • Lane tracking • Lane editing • Trace extraction • Trace standardisation • Mobility correction • Background substitution

  28. Strategy Libraries Sequencing Assembly Closure Annotation Release Pre-processing • Base calling using Phred • modifies SCF file format • Quality clipping from Phred • Vector clipping • Sequencing vector • Cloning vector • Screen for contaminants • Feature mark up (repeats/transposons)

  29. Strategy Libraries Sequencing Assembly Closure Annotation Release Finishing • Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap) • Closure: Process of ordering and merging consensus sequences into a single contiguous sequence • Finished is defined as sequenced on both strands using multiple clones. In the absence of multiple clones the clone must be sequenced with multiple chemistries. The overall error rate is estimated at less than 1 error per 10 kb

  30. Strategy Libraries Sequencing Assembly Closure Annotation Release Genome Assembly • Pre-assembly (assembly algorithm) • Assembly • Automated appraisal • Manual review

  31. Strategy Libraries Sequencing Assembly Closure Annotation Release Pre-Assembly • Convert to CAF format • flatfile text format • choice of assembler • choice of post-assembly modules • choice of assembly editor www.sanger.ac.uk/Software/CAF

  32. Strategy Libraries Sequencing Assembly Closure Annotation Release Assembly • Assemble using Phrap • Read fasta & quality scores from CAF file • Merge existing Phrap .ace file (previous assembly) as necessary • Adjust clipping (where vector, quality start)

  33. Strategy Libraries Sequencing Assembly Closure Annotation Release Assembly appraisal • auto-edit • removes 70% of read discrepancies of seq. assembly (highlight misassembly); manually • Remove cloning vector • Mark up sequence features (for finisher) • “Finish” Program (or Program “AutoFinish”) • Identify low-quality regions • Cover using ‘re-runs’ and ‘long-runs’ • Compare with current databases • plate contamination

  34. Strategy Libraries Sequencing Assembly Closure Annotation Release Manual Assembly appraisal • Use a sequence editor (GAP/consed) • Tools to identify Internal joins • Tools to identify and import data from an overlapping projects • Tools to check failed or mis-assembled reads for inclusion in project

  35. Strategy Libraries Sequencing Assembly Closure Annotation Release Manual editing • Sanger uses 100% edit strategy • Where additional data is required: • Check clipping • Additional sequencing • Template / Primer / Chemistry • Assemble new data into project • GAP4 Auto-assemble • Repeat whole process

  36. Strategy Libraries Sequencing Assembly Closure Annotation Release Manual Quality Checks • Force annotation tag consistency • All unedited data is re-assembled using Phrap • All high-quality discrepancies are reviewed • Confirm restriction digest (clones) • Check for inverted repeats • Manually check: • Areas of high-density edits • Areas with no supporting unedited data • Areas of low read coverage (need to confirm)

  37. Strategy Libraries Sequencing Assembly Closure Annotation Release Gap closure • Read pairs • PCR reactions (long-range / combinatorial) • Small-insert libraries • Transposon-insertion libraries

  38. Strategy Libraries Sequencing Assembly Closure Annotation Release Gap closure - contig ordering • Read pair consistency • STS mapping • Physical mapping • Genetic mapping • Optical mapping • Large-insert clone • skims • end-sequencing

  39. Strategy Libraries Sequencing Assembly Closure Annotation Release Annotation • DNA features (repeats/similarities) • Gene finding • Peptide features • Initial role assignment • Others- regulatory regions

  40. Annotation of eukaryotic genomes Genomic DNA ab initio gene prediction transcription Unprocessed RNA RNA processing Mature mRNA Gm3 AAAAAAA Comparative gene prediction translation Nascent polypeptide folding Active enzyme Functional identification Function Reactant A Product B

  41. Genome analysis overview: C.elegans

  42. Strategy Libraries Sequencing Assembly Closure Annotation Release DNA features • Similarity features • mapping repeats • simple tandem and inverted • repeat families • mapping DNA similarities • EST/mRNAs in eukaryotes • Duplications, • RNAs • mapping peptide similarities • protein similarities

  43. Strategy Libraries Sequencing Assembly Closure Annotation Release Gene finding • ORF finding (simple but messy) • ab initio prediction • Measures of codon bias • Simple statistical frequencies • Comparative prediction • Using similarity data • Using cross-species similarities

  44. Strategy Libraries Sequencing Assembly Closure Annotation Release Peptide features • Peptide features • low-complexity regions • trans-membrane regions • structural information (coiled-coil) • Similarities and alignments • Protein families (InterPro/COGS)

  45. Strategy Libraries Sequencing Assembly Closure Annotation Release Initial role assignment • Simple attempt to describe the functional identity of a peptide • Uses data from: • peptide similarities • protein families • Vital for data mining • Large number of predicted genes remain hypothetical or unknown

  46. Strategy Libraries Sequencing Assembly Closure Annotation Release Other regulatory features • Ribosomal binding sites • Promoter regions

  47. Strategy Libraries Sequencing Assembly Closure Annotation Release Data Release • DNA release • Unfinished • Finished • Nucleotide databases • GENBANK/EMBL/DDBJ • Peptide databases • SWISSPROT/TREMBL/GENPEPT • Others

  48. Real World Example: Malaria Genome Project If time permits.

  49. Sequencing the Plasmodium genomes Four species of malaria infect man: Plasmodium falciparum P. vivax P. malariae P. ovale Four species of malaria infect rodents: P. yoelii P. berghei P. chabaudi P. vinckei

  50. Plasmodium falciparum • ~30 million base pairs (Mb) • 80% (A+T) • 14 chromosomes • DNA “unstable” in E. coli • No large insert DNA clones suitable for sequencing • Too large for whole genome shotgun (‘96) • Whole chromosome shotgun strategy was selected

More Related