Genome Sequencing and Annotation (Part 1)

Genome Sequencing and Annotation (Part 1)

Objective of most genome projects Sequencing – DNA, mRNA Identify genes characterize gene features This chapter How blocks of DNA seqs. are obtained How these blocks are assembled into contigs then genomes Bioinformatics – how to do seq. alignment, such as cDNA/EST, genome seqs. Annotation of ORF, Other features of gene – repetition elements, variable distribution of GC content, evolutionary conserved elements Gene annotation by cross species annotation

2.1 (Part 2) The principle of dideoxy (Sanger) sequencing Automated DNA sequencing 1974, F. Sanger developed the chain-termination method (Sanger sequencing) Sanger won his second Noble prize for inventing this process

Automated DNA sequencing • Most current sequencing projects use the chain termination method • Also known as Sanger sequencing, after its inventor • Based on action of DNA polymerase • Adds nucleotides to complementary strand • Requires template DNA and primer

Chain-termination sequencing • Dideoxynucleotides (ddA, ddT, ddC or ddG) stop synthesis • Chain terminators (DNA polymerase cannot add another nucleotide) • Included in amounts so as to terminate every time the base appears in the template • Use four reactions • One for each base: A,C,G, and T Template 3’ ATCGGTGCATAGCTTGT 5’ 5’ TAGCCACGTATCGAACA* 3’ 5’ TAGCCACGTATCGAA* 3’ 5’ TAGCCACGTATCGA* 3’ 5’ TAGCCACGTA* 3’ 5’ TAGCCA* 3’ 5’ TA* 3’ Sequence reaction products

Sequence detection • To detect products of sequencing reaction • Include labeled nucleotides • Formerly, radioactive labels (33P or 35S) were used • Now fluorescent labels • Use different fluorescent tag for each nucleotide • Can run all four reactions in a single gel lane or capillary tube TAGCCACGTATCGAA* TAGCCACGTATC* TAGCCACG* TAGCCACGT*

Sequence separation Sequence separation – • Terminated chains need to be separated • Requires one-base-pair resolution • See difference between chains of X and X+1 base pairs • Gel electrophoresis • Very thin gel • High voltage applied • Works with radioactive or fluorescent labels • Negative pole at the top + C A G T C A G T

Sequence reading of radioactively labeled reactions A T C G – • The final step of sequencing is to read the sequence • Radioactive labeled reactions • Gel dried • Placed on X-ray film • Film developed, the position of each band becomes visible • Sequence read from bottom up (the positive pole) • Each of the four lanes giving the position of a different base: A, T, C or G +

Sequence reading of fluorescently labeled reactions • Fluorescently labeled reactions scanned by laser as particular point is passed • Color picked up by detector • Output sent directly to computer • The read out is given both in terms of bases and the intensity of each color, so that ambiguous readings are easily identified

Summary of chain termination sequencing A primer is extended by DNA polymerase based on the sequence present in the template strand. The chain is terminated by different ddNTP that are complementary to the template strand. Four reactions are separated on a gel that can resolve one-base differences. The seq. is then read from the bottom of gel to the top.

High-Throughput Sequencing The new techniques and equipment include: (1) Four-color fluorescent dyes have replaced the radioactive label (2) Rather than stopping the electrophoresis at a particular time, the products are scanned for laser-induced fluorescence just before the run off the end of the electrophoresis medium (3) Improvements in the chemistry of template purification and the sequencing reaction (4) Slab gel electrophoresis gave way to capillary electrophoresis with the introduction in 1999 of Applied Biosystem’s ABI Prism 3700 automated sequencers, which in turn were updated with ABI Prism 3730 DNA analyzers in 2003 (deliver extremely high quality, long reads; save time and money) ABI Prism 3730 DNA analyzers

Reading sequence traces Base-calling – the reading of raw sequence traces Now routinely performed using automated software that reads bases, aligns similar seqs. and editing Program – phredhttp://www.phrap.org The program assign probability scores to the accuracy of each base call as the trace is read

2.3 Automated sequence chromatograms • This seq. shows ‘noiseness’ of the first 30 bp of a run. • The middle two rows show a segment of two seqs. that are polymorphic for both SNPs and an indel. • A decline in seq. quality typically occurs after about 800 bp.

Ex. 2.1 Reading a sequence trace The base labeled N – due to poor seq. quality Two peaks of the same height are observed at the same location, the site is heterozygous for a C and T SNP.

Figure 2.5 An aligned-reads window in consed Contig Assembly

Assembling DNA seq. fragments • NCBI dbest databases http://www.ncbi.nlm.nih.gov/Database/ • View the EST statistics • FTP EST files

Assembling DNA seq. fragments • IFOM assembler • http://bio.ifom-firc.it/ASSEMBLY/assemble.html • Multiple EST seqs.  contig • max. number of seqs. you can enter is 10000 !! • use gi(15744427, 19124086, 8147732, 8147734, 20393914,13728017) • Length (850, 1062, 634, 596, 869, 768) bp • resulting in a single contig consensus seq., can be used for similarity search against db

Assembling DNA seq. fragments – 6 GI fragments >gi|15744427|gb|BI752849.1|BI752849 603022060F1 NIH_MGC_114 Homo sapiens cDNA clone IMAGE:5192510 5', mRNA sequenceCGGGGTGCTGCGAGCGCGGGGCCAGACCAAGGCGGGCCCGGAGCGGAACTTCGGTCCCAGCTCGGTCCCCGGCTCAGTCCCGACGTGGAACTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATATGCGTTGACTGTGCCATGGGAGAGTAGCAGAAAACAGCAGCATGCTCAACCTCCCACTTTCTCTATTGGATGTGGACTCAAAGCCCT >gi|19124086|gb|BM807263.1|BM807263 AGENCOURT_6574903 NIH_MGC_124 Homo sapiens cDNA clone IMAGE:5732238 5', mRNA sequenceGTCCGGAATTCCCGGGATCTCAGCAGCGGAGGCTGGACGCTTGCATGGCGCTTGAGAGATTCCATCGTGCCTGGCTCACATAAGCGCTTCCTGGAAGTGAAGTCGTGCTGTCCTGAACGCGGGCCAGGCAGCTGCGGCCTGGGGGTTTTGGAGTGATCACGAATGAGCAAGGCGTTTGGGCTCCTGAGGCAAATCTGTCAGTCCATCCTGGCTGAGTCCTCGCAGTCCCCGGCAGATCTTGAAGAAAAGAAGGAAGAAGACAGCAACATGAAGAGAGAGCAGCCCAGAGAGCGTCCCAGGGCCTGGGACTACCCTCATGGCCTGGTTGGTTTACACAACATTGGACAGACCTGCTGCCTTAACTCCTTGATTCAGGTGTTCGTAATGAATGTGGACTTCACCAGGATATTGAAGAGGATCACGGTGCCCAGGGGAGCTGACGAGCAGAGGAGAAGCGTCCCTTTCCAGATGCTTCTGCTGCTGGAGAAGATGCAGGACAGCCGGCAGAAAGCAGTGCGGCCCCTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATACGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGCGGGGAACAGGGTCCTGAAACCTGACCATTTTGCCCCAGACCTTGACCAATCCACCTCATGGCGATTCTCCCTCCAGGAATTCCCCGACCGAGAAAAAATTGGCCACTTCCCCGGAATTTCCCCCCAAAAACTTGGAATTTCACCCAAAACCTTTCCCATGTAAACCCGGAAACCCTGGGGAAGGCT >gi|8147732|gb|AW958049.1|AW958049 EST370119 MAGE resequences, MAGE Homo sapiens cDNA, mRNA sequenceGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGGCACGAGTGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCTCATGCGATTCTTCATCAGGAATTCACAGACGAGAAAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGAACCTTCCAATGAAGCGAGAATCTTGTGAAGCTGAAGAACAGTCTGGAAGGCAAGATGAGCTTTTTGCTGGGAATGCGCACGTGGAAAGGCAGAATTCGGTCATAA >gi|8147734|gb|AW958051.1|AW958051 EST370121 MAGE resequences, MAGE Homo sapiens cDNA, mRNA sequenceGGAGCTGGCCTACTGCCTGCAGAAGTGCAACGTGCCCTTGTTTGTCCAACATGATGCTGCCCAACTGTACCTCAAACTCTGGAACCTGATTAAGGACCAGATCACTGATGTGCACTTGGTGGAGAGACTGCAGGCCCTGTATATGATCCGGGTGAAGGACTCCTTGATTTGCGTTGACTGTGCCATGGAGAGTAGCAGAAACAGCAGCATGCTCACCCTCCCACTTTCTCTTTTTGATGTGGACTCAAAGCCCCTGAAGACACTGGAGGACGCCCTGCACTGCTTCTTCCAGCCCAGGGAGTTATCAAGCAAAAGCAAGTGCTTCTGTGAGAACTGTGGGAAGAAGACCCGTGGGAAACAGGTCTTGAAGCTGACCCATTTGCCCCAGACCCTGACAATCCACCTTATGCGATTCTCCATCAGGAATTCACAGACGAGAAAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTTGAGGAGCAATCTGGAGGGCATATGAGCTTTTTGCTGTGATTGCGCACCTGGGAATGCAAAACTCCGTCATTACTG >gi|20393914|gb|BQ213074.1|BQ213074 AGENCOURT_7559959 NIH_MGC_72 Homo sapiens cDNA clone IMAGE:6055692 5', mRNA sequenceAGATCTGCCACTCCCTGTACTTCCCCCAGAGCTTGGATTTCAGCCAGATCCTTCCAATGAAGCGAGAGTCTTGTGATGCTGAGGAGCAGTCTGGAGGGCAGTATGAGCTTTTTGCTGTGATTGCGCACGTGGGAATGGCAGACTCCGGTCATTACTGTGTCTACATCCGGAATGCTGTGGATGGAAAATGGTTCTGCTTCAATGACTCCAATATTTGCTTGGTGTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGGCCAAGGGCAGTGGCAGGGGGTATTTCAGTATTATACCACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGGGCTGGTTGTAATTTTTTCCCTTTGAAGAAACACCATTAATTTCCTAATGAATCCAAGTGGTTTGTAACTTGCCTATTCCTTTTATTCCAGCAAAAAATTAATTGATCATCCCCTCCCCCAAAAAATAGGGG >gi|13728017|gb|BG206330.1|BG206330 RST25778 Athersys RAGE Library Homo sapiens cDNA, mRNA sequenceTCCTGGGAAGACATCCAGTGTACCTACGGAAATCCTAACTACCACTGGCAGGAAACTGCATATCTTCTGGTTTACATGAAGATGGAGTGCTAATGGAAATGCCCAAAACCTTCAGAGATTGACACGCTGTCATTTTCCATTTCCGTTCCTGGATCTACGGAGTCTTCTAAGAGATTTTGCAATGAGGAGAAGCATTGTTTTCAAACTATATAACTGAGCCTTATTTATAATTAGGGATATTATCAAAATATGTAACCATGAGGCCCCTCAGGTCCTGATCAGTCAGAATGGATGCTTTCACCAGCAGACCCGGCCATGTGGCTGCTCGGTCCTGGGTGCTCGCTGCTGTGCAAGACATTAGCCCTTTAGTTATGAGCCTGTGGGAACTTCAGGGGTTCCCAGTGGGGAGAGCAGTGGCAGTGGGAGGCATCTGGGGGCCAAAGGTCAGTGGCAGGGGGTATTTCAGTATTATACAACTGCTGTGACCAGACTTGTATACTGGCTGAATATCAGTGCTGTTTGTAATTTTTCACTTTGAGAACCAACATTAATTCCATATGAATCAAGTGTTTTGGAACTGCTATTCATTTATTCAGCAAATATTTATTGGTCATCTTTTCTCCATAAGATAGTGTGATAAACACAGCATGAATAAAGGTATTTTCCACACAGACAAGTGTTTTTTCACAAAATTATTNATTTTGNTGGGGCTGTGGCGGCCGCTTCCTTTATGGGGGGGAATTTAGAACCCGTTCCTGACGCGGGGGN

Assembling DNA seq. fragments List of assembled fragments

Assembling DNA seq. fragments Overlap details

Assembling DNA seq. fragments End of overlap details Assembled mRNA sequence

Box 2.1 Pairwise Sequence Alignment • The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1 alignment 2 Seq. 1 ACGCTGA ACGCTGA Seq. 2 A - - CTGT ACTGT - - Seeks alignments  high seq. identity, few mismatchs and gaps Assumption – the observed identity in seqs. to be aligned is the result of either random or of a shared evolutionary origin Identity ≠ similarity Sequence identity =Homology (a risky assumption) Sequence identity ≠Homology

Box 2.1 Pairwise Sequence Alignment Same true alignment arise through different evolutionary events Scoring scheme: substitution  -1, indel  -5, match  3 indel Score 9 5 4 4 Figure A Common evolutionary events and their effects on alignment

Box 2.1 Pairwise Sequence Alignment Find the optimal score  the best guess for the true alignment Find the optimal pairwise alignment of two seqs.  inserted gaps into one or both of them  maximize the total alignment score Dynamic programming (DP) – Needleman and Wunsch (1970), Smith and Waterman (1980), this algorithm guarantees that we find all optimal alignments of two seqs. of lengths m and n BLAST is based on DP with improvement on speed Prof. Waterman http://www.usc.edu/dept/LAS/biosci/faculty/waterman.html

Box 2.1 Pairwise Sequence Alignment The score for alignment of i residues of sequence 1 against j residues of sequence 2 is given by where c(i,j) = the score for alignment of residues i and j and takes the value 3 for a match or -1 for a mismatch, c(-,j) = the penalty for aligning a residue with a gap, which takes the value of -5

Box 2.1 Pairwise Sequence Alignment • The entry for S(1,1) is the maximum of the following three events: • S(0,0) + c(A,A) = 0 + 3 = 3 [c(A,A) = c(1,1)] • S(0,1) + c(A, -) = -5 + -5 = -10 [c(A, -) = c(1, -)] • S(1,0) + c(-, A) = -5 + -5 = -10 [c(- ,A) = c(-, 1)] • Similarly, one finds S(2,1) as the maximum of three values: (-5)-1=-6; 3-5=-2; and (-10)-5=-15  the best is entry is the addition of the C indel to the A-A match, for a score of -2 (see next page).

Box 2.1 Pairwise Sequence Alignment The alignment matrix of sequences 1 and 2 S(2,1) = max {S(1,0) + c(2,1), S(1,1) + c(2,-), S(2,0) + c(-,1)} = max { S(1,0) + c(C,A), S(1,1) + c(C,-), S(2,0) + c(-,A) } = max { -5-1, 3-5, -10-5 } = -2

Box 2.1 Pairwise Sequence Alignment Traceback  determine the actual alignment From the top right hand corner  the (7,5) cell For example the 1 in the (7,5) cell could only be reached by the addition of the mismatch A-T ACGCTGA A - - CTGT or ACGCTGA AC - - TGT 4 matches 1 mismatch 2 indels Ambiguity – has to do with which C in seq. 1 aligns with the C in seq. 2

Box 2.1 Pairwise Sequence Alignment Parameters settings - Gap penalties • Default settings are the easiest to use but they are not necessarily yield the correct alignment • constant penalty independent of the length of gap, A • proportional penalty penalty is proportional to the length L of the gap, BL (that is what we used in the this lecture) • affine gap penalty gap-opening penalty + gap-extension penalty = A+BL • There is no rule for predicting the penalty that best suits the alignment • Optimal penalties vary from seq. to seq.  it is a matter of trial and error • Usually A > B, because of opening a gap (usually A/B ~ 10) • Hint: (1) compare distantly related seqs. high A and very low B often give the best results  penalized more on their existence than on their length, (2) compare closely related seqs., penalize both of extension and extension

Exercise 2.2 Computing an optimal sequence alignment • Two score schemes • Gap penalty = -5, mismatch = -1, match =3 • Gap penalty = -1, mismatch = -1, match =3 • First alignment score = 5*3 + 2*(-1) =13 • Second/Third alignment score = 6*3 + 2*(-5) = 8 • (2) First alignment score = 5*3 + 2*(-1) =13 • Second/Third alignment score = 6*3 + 2*(-1) = 16 • A more serious problem – identify the wrong alignment

Exercise 2.2 Computing an optimal sequence alignment Gap penalty = -5 Gap penalty = -1

Emerging Sequencing Methods Costs of genome sequencing Mid-2000 - $30-50 Million dollars to sequencing a mammalian genome Target $1000 per human genome by the year 2010 J. Craig Benter Foundation - $500,000 award for the first person to achieve this goal New technologies • Sequencing by hybridization (SBH) – detect whether an exact match is present in a sample of DNA or not • Mass spectrophotometric technique – ionized fragment, time of flight • Nanopore sequencing strategies - Ultrafast and relative inexpensive sequencing of long DNA fragments • Single-molecule approach – Solexa, Visigen and Genovoxx • Single-molecule polony sequencing

Figure 2.6 Single-molecule polony sequencing Emerging Sequencing Methods Dilute solution of DNA are plated onto a glass microscope slide. In situ PCR produces thousands of tiny colonies of DNA, which incorporated of single dye-labeled dNTPs. Polony – PCR colonies (聚集區) The slide is read after each cycle of Incorporation of a new base, allowing short seqs. to be determined. Each numbered polony produces a short 20-25 nucleotide seq. as shown. These can then be assembled computationally into a contiguous seq.

Figure 2.7 (Part 1) Hierarchical versus shotgun sequencing Genome Sequencing • Whole genome seqs. are assembled from • ~105 of fragments, each typically between • 500 and 1000 bp in length. • Two general approaches for fragmentation • and assembly: (1) hierarchical seq. (2) shotgun • seq. • For historical overview, see • http://www.sciencemag.org/feature/plus/sfg/human/timeline1.shtml • Hierarchical seq. • * First develop a low resolution physical alignment to measure the seq. is obtained in large order pieces. • * Break the genome into small fragments and use computer algorithms to assemble them, see Figure 2.7 • Most new genome projects adopt the shotgun approach.

Genome Sequencing – hierarchical sequencing Top down, map-based or clone-by-clone strategy ~ late 1980 Genome  break into small fragments The relative locations of the fragments are known BEFORE sequencing Advantages • It fostered (help develop) assembly of high-resolution physical and genetic maps • Allow groups working around the global Technology for cloning large fragments of genomes are progressed rapidly throughout the1990s, such as E. coli, S. cerevisiae, C. elegans. A. thaliana. Top-down seq.  clone seqs. as managable units of framgments (50 – 200 kb in length) Clone vectors – BAC (~300 kb), PAC (~100 kb), phage-derived cosmids

Figure 2.7 (Part 2) Shotgun sequencing Genome Sequencing – Shotgun sequencing In the shotgun approach, no attempt is made to order the clones in advance, Instead, the whole genome is assembled using computer algorithms that order contigs based on their overlapping sequences.

Figure 2.8 Cloning vectors used in genome sequencing Cloning vectors used in genome sequencing

Genome Sequencing – hierarchical sequencing DNA libraries • By restriction enzyme (RE) or sonication (以超音波處理) • Fragments are ligated into a multiple cloning site (mcs) in the vector • Aim for 5- to 10-fold redundancy larger than 5 to 10 times in the genome library • Each clone will have different ends  possible to select a scaffold of clones that forms a contiguous seq. coverage – a tiling (貼瓷磚) path • By aligning the regions of overlap (Fig. 2.9) • The tiling path can be assembled using a combination of 3 methods: (1) hybridization, (2) fingerprinting, and (3) end-sequencing

Figure 2.9 Hierarchical assembly of a sequence-contig scaffold (supercontig) Genome Sequencing – hierarchical sequencing • A minimal tiling path through a library of aligned BAC clones that ensures complete coverage of the chromosome is chosen. • After sequencing independent shotgun libraries for each BAC. • Small gaps in the sequenced clone contigs remain. • These are closed as far as possible by merging the two BAC sequences, as well as by the addition of mate-pair information (yellow) and cDNA structural information (red), which establishes the orientation and distance between cloned segments.

Genome Sequencing – hierarchical sequencing • Hybridization • All of the clones in a library that carry a particular seq. can be identified rapidly by hybridizing a small radioactively or chemically labeled probe containing the seq. to a filter on which is printed an array of ~10000 of clones (Fig. 2.10A) • Fingerprinting • Study the Restriction Enzyme (RE) patterns • Assemble contigs of large insert clones is to compare and • align them according to RE • RE ~ 6 bp  46 = 212 ~ 4000 bp • For BAC, 100 kb  100 kb/4 kbp ~ 20 – 30 fragments • these fragments can be separated by electrophesis  Fingerprint profile  BAC alignment by gel  software alignment  overlapping  Contigs  assemble of ~Mb length contigs

Figure 2.10 Aligning BAC clones by hybridization and fingerprinting Genome Sequencing – hierarchical sequencing • (A) A macroarray of BAC clones is probed • with a short, radioactive fragment to • identify all BACs that carry a specific • fragment. • These clones are digested with a RE, end- • labeled, and separated by gel electrophoresis, • Software converts the bands to a virtual • profile, shown hypothetically for a small • portion of four bands (high-ligated box in • part B). Shared bands (red or blue) imply • that the two clones share the same seq. • Green indicates the vector band common to • all clones. • The fingerprint profile is then converted into • a BAC alignment, In this example, clone 2 • does not share any bands with the others and • so is placed into a seq. BAC contig, while the • other three clonesform a tiling path.

Genome Sequencing – hierarchical sequencing • End-sequencing • Fill in the gapsafter fingerprinting. How ? • sequencing both ends of the collection of BAC clones • Once a critical threshold of seqs. have been achieved  overlap • For example, along a 10 Mbp genome, end seqs. of 10,000 BAC clones,  provide a seq. tag every 5kb (for a 5-fold coverage) • Along a 10 Mbp genome 10 Mbp/10000 BAC  1 kbp/BAC • Five fold  10 Mb/2000 BAC ~ 5 kb (a seq. tag distance) • Given this tag density, it is possible to close gap < 50 kb • Once the Tiling path is chosen  shotgun the BAC clones into small fragments • Subcloning, use M13 phagemid (~1 kb, exist as dsDNA and ssDNA • or clone 2 ~ 3 kb fragments into a plasmid vector

Genome Sequencing – Shotgun sequencing • Use computer algorithm to assemble the seqs. (~100,000) • About 5 ~ 10 folds redundancy for each fragment • Library - From a single whole genome • After MSA  screen out repetitive seqs., overlap reads of the same seq.  generate • unitigs and scaffolds  >90% of the seqs. are assembled • Finishing phase – closing gaps, cleaning up ambiguities  take as much time as • the shotgun phase • Users are asked to trust the assemblies • Celera Genomics used the following software to assemble the seqs. • Screener – to mask (not removed) seqs. that contain repetitive DNA • (such as microsatellites, LINE, Alu repeats, retrotransposons and ribosomal DNA) • Overlapper – compares every unscreened read against every other unscreened read, • searching for overlaps of a predetermined length and identity. • Parallel processing on 40 supercomputers, each with 4GB RAM, allowed the 27 M • screened human seqs. reads to be overlapped in < 5 days ! • Repeat-induced overlaps of a seq. are resolved using the Unitigger (see Figure 2.11). • Scaffolder– uses mate-pair information to link U-unitigs into scaffold contigs

Genome Sequencing – Shotgun sequencing • Figure 2.11 • Seq. alignment between two or more shotgun clones can arise between unique seqs. (left) or repetitive seqs. (right). • (B) The Overlapper aligns unitigs, which are identified as unique seq. alignments (U-untigs) or overcollapsed repeats (blue). • Two contigs can be aligned and • oriented by using mate-pair seq. • information from the ends of longer (10- or 50-kb) clones, as shown at the bottom, while mate-pairs from 2-kb fragments allow assembly of scaffolds despite the presence of simple repeats such as microsatellites (blue) that are masked before performing alignments. Figure 2.11 U-unitigs and repeat resolution

Genome Sequencing – Shotgun sequencing Figure 2.12 shows the estimated coverage of the fly and human whole genomes after initial assembly: in both cases, 84% or more of the genomes was covered by scaffolds at least 100 kb in length, while most scaffolds were in the Mb range.  seq. coverage from 5x to 10x  a 10%  in the proportion of scaffolds of lengths up to 1 Mb. The plot shows the percentage of Scaffolds that have a length greater than that indicated for the fly 10x, human 8x (CSA) and human 5x (whole genome assembly WGA) seqs. generated by Celera. The fly and CSA assemblies include shredded (撕成碎片) seqs. generated from BAC clones by public genomes sequencing efforts. Figure 2.12 Proportion of fly and human genomes in large scaffolds

NCTS http://math.cts.nthu.edu.tw/Mathematics/conference-PT2005.html UCSD http://research.calit2.net/recomb-workshop05/

Genome Sequencing and Annotation (Part 1)

Genome Sequencing and Annotation (Part 1)

Presentation Transcript

Genome sequencing

Genome annotation

Genome analysis and annotation Part II

Genome sequencing

Genome Annotation

Mouse Genome Sequencing

Genome Annotation

Genome Sequencing and genome viewers

Genome Annotation

Genome Annotation

Genome sequencing and annotation

Genome Annotation

Tumor Genome Sequencing

Genome Annotation

Whole Genome Sequencing, Assembly and Annotation

Genome Annotation

Genome Sequencing Impact on Annotation

Genome Sequencing and Assembly Progress

Genome sequencing and annotation

Genome analysis and annotation

Genome analysis and annotation Part II

Genome Annotation