Lecture 08 DNA Sequence Analysis and Fragment Assembly System (FAS)
Lecture 07 DNA Sequence Analysis
(AAAAAA)n 3’ 7-mG cap Exon 1 Exon 2 Exon 3 Exon 4 The Organization of an Eukaryotic Gene GENE Exon 1 Intron Exon 2 Intron Exon 3 Intron Exon 4 Promoter Enhancer Transcription Poly(A) signal mRNA transcript 5’ 3’ 5’-untranslated region Exon 1 Intron Exon 2 Intron Exon 3 Intron Exon 4 3’-untranslated region Processing Mataure mRNA stop start 5’
Gene identification involves 4 main stages Find the putative coding region(s) in the sequence Open reading frame CpG islands Tandemly and dispersed repeats Promoter regions (TATA box, cap signal, CCAAT-box) Transcription factors, Poly-A sites Find non-coding features of interest in the sequence Branch point signal CT(G,A)A(C,T) Determine the exon-intron organization 5’ and 3’ splice sites: AG/GUAAGU--------------PyPyPyPyPyPyPyPy-CAG/G motif, signal and pattern Blast, FASTA Functional studies Identify the gene
GENE FINDERS Banbury Cross http://igs-server.cnrs-mrs.fr/igs/banbury FGENEH http://genomic.sanger.ac.uk/gf/gf.shtml GeneID http://www1.imim.es/geneid.html GeneMachine http://genome.nhgri.nih.gov/genemachine GeneParser http://beagle.colorado.edu/_eesnyder/GeneParser.htl GENSCAN http://genes.mit.edu/GENSCAN.html Genotator http://www.fruitfly.org/_nomi/genotator/ GRAIL http://compbio.ornl.gov/tools/index.shtml GRAIL-EXP http://compbio.ornl.gov/grailexp/ HMMgene http://www.cbs.dtu.dk/services/HMMgene/ MZEF http://www.cshl.org/genefinder PROCRUSTES http://www-hto.usc.edu/software/procrustes RepeatMasker http://ftp.genome.washington.edu/RM/RepeatMasker.html Sputnik http://rast.abajian.com/sputnik/
Function Command GCG SeqWEB + + + + + + + + + + + + + + + + + + + + - - + - Sequence manipulation ORF Searching Mapping (restriction sites) Mapping (transcription factors) Reverse Frames Map Translate Map (-minc) (-maxc) Mapsort (-exclude) (-digest) Mapplot Map tfsites
What to do next? The predictions by these programs is just that: a prediction. NEVER TRUST A COMPUTER!
Exercise 91-07 Programs used in this exercise: (1) Sequence manipulation – reverse (3)ORF Searching – frames , map , translate (4)Mapping (restriction sites) – map (-minc, -maxc), mapsort(-exclude, -digest), mapplot, plasmidmap (5)Mapping (transcription factor) – map(tfsites). Sequences used in this exercise: gb:z18853 (C.elegans mRNA for capping protein alpha subunit.) cds:10-858 gb:x03795 (Human mRNA for platelet derived growth factor A-chain, PDGF-A) cds:388-1020.
Lecture 08 Fragment Assembly System (FAS)
Fragment Assembly System (FAS) Please Download Bioinfo91-08.exe 解壓縮後含下列檔案: Bioinfo91-08.htm (上課 slide) Exercise91-08.doc (上課習作) Gelassemble commands.doc & SeqED commands.doc (指令集) Seq01.txt - seq10.txt (習作用序列)
Fragment Assembly System (FAS) (1) Store fragment sequences; (2) Recognize overlapping sequences and create aligned assemblies, called contigs; (3) Display, edit and output the contigs for further analysis. Assemble overlapping fragment sequences from a sequencing project. 5 3 Contig 1 1 4 Contig 2 2 Consensus A contig may not contain more than 1,650 fragments and may not be longer than 200,000 bases. No single fragment may be longer than 2,500 bases
Begins a fragment assembly session bycreating a new fragment assembly project or by identifying an existing project. GelStart Enters a fragment sequences to a fragment assembly project from your terminal keyboard, a digitizer, or existing sequence files. GelEnter Aligns the sequences in a fragment assembly project into assemblies called contigs. GelMerge A multiple sequence editor for viewing and editing contigs assembled by GelMerge. GelAssemble GelView Displays the structure of the contigs in a fragment assembly project. Contig: mu26b 8 mu18b +---------------------> 7 mu9 <---+ 6 mu32 +---> 5 mu26 <----+ 4 mu18 +----> 3 mu27 <--------------------------------+ 2 mu26b <------------------+ C CONSENSUS <-----------------------------------------------+ |----------|----------|----------|---------|---------| 0 100 200 300 400 Breaks up the contigs in a fragment assembly project into single fragments. GelDisassemble
GelStart Use GelStart to create a new project database for each sequencing project. For each new project, GelStart creates a new directory, named after the project, as a subdirectory of your current working directory. gcg% gelstart -check Minimal Syntax: % gelstart [-NAME=]MyProject -Default Prompted Parameters: -NEWproject begins a new sequencing project -VECtors=GB:M13mp18,GB:SynpBR322 highlights specified sequences in GELENTER -SITes=GAATTC,GGATCC highlights specified patterns in GELENTER Local Data Files: None Optional Parameters: -DELete deletes a whole project! -NOMONitor suppresses the screen monitor
<ctr>d screen mode command mode <return> SeqED • SeqEd is an interactive editor for entering and modifying sequences and for assembling parts of existing sequences into new genetic constructs. You can enter sequences from the keyboard or from a digitizer. AGTCTTAGTCGATCGTAcTGCATRCGA ....|:.......:|.........i.......:.|.........|.........|.........|.........|.. 0 10 20 30 40 50 60 70 "sample.seq" 27 nucleotides
Screen Mode G, A, T, . . . - insert a sequence character <Delete> - delete a sequence character <Ctrl>H - delete a sequence character /TAACG<Return> - find the next occurrence of TAACG (last pattern entered is the default) 1<Return> - move to start of the sequence <Ctrl>E - move to end of the sequence [n]<Right-arrow> - go ahead n characters [n]<Left-arrow> - go back n characters <Up-arrow> - go up to check sequence <Down-arrow> - go down to original sequence 'markcharacter - go to marked position 37<Return> - go to position 37 (any positive integer) < - go back 50 characters > - go ahead 50 characters <Ctrl>R - redraw the screen <Ctrl>D - enter command mode [n] is an optional numeric parameter.
Command Mode • EDit seqname - get a new sequence file to edit • [n] Include [seqname] - insert another sequence [at position n] • (SeqEd prompts for range and strand) • s,f Delete - delete a range of bases • [s] Check [/Blind] - check a range of bases [beginning at s] • 37 - go to base 37 • REDraw - redraw the screen • [n] COmment comment - insert a comment [at position n] • [n] COmment - enter comment editing mode [at position n] • [n] HEAding - edit documentary heading [at line n] • change - enter screen mode (<Return> is sufficient) • screen - enter screen mode (<Return> is sufficient) • OVERstrike - enter overstrike mode • INSert - enter insert mode • [n] Mark markcharacter - mark the sequence [at position n] • PERFect - require finds to be perfect matches • PROtein - set sequence type to PROTEIN • NUCleotide - set sequence type to NUCLEOTIDE • [s,f] Write [seqname] - write [a part of] the sequence to a file • DIGitizer - enter digitizer mode • RELoad - enter reload mode • ACCept - terminate reload mode • Help - show commands in screen and command modes • [s,f] EXit [seqname] - write [a part of] the sequence and quit • Quit - quit the editor without writing the sequence • [n] indicates an optional parameter. • s and f are numbers for start and finish of a range of interest
GelEnter GelEnter is a sequence editor that accepts sequence data. gcg% gelenter –check Minimal Syntax: % gelenter [-INfile1=]mu*.seq Prompted Parameters: None Local Data Files: set.keys (must be in your current working directory to be used) Optional Parameters: -ENTER=mu*.seq enters existing files into the database -STAden enters existing Staden format files into the database -FASTA enters existing FASTA format files into the database -SINGlecommand automatically returns to screen mode after each command -PERFect sets find to search for perfect symbol matches -VECtors=gb:synpbr322 highlights sequences from pBR322 -SITes=gaattc highlights GAATTC patterns -LANes=g,A,T,C sets lane order for digitizer -MINOverlap=10 sets minimum overlap length for Reload command -PCTOverlap=95 sets stringency for the Reload command -TOLerance=0.4 sets tolerance for digitizing ambiguity (0 to 1), with 1 being the most tolerant
GelEnter GelEnter accepts any valid GCG sequence character. Once you enter sequences into a project database, you can no longer edit them with GelEnter. gcg2 21% gelenter seq02.dat GelEnter adds fragment sequences to a fragment assembly project. It accepts sequence data from your terminal keyboard, a digitizer, or existing sequence files. "seq02" 593 nucleotides IUB/GCGMeaning A A C C G G T/U T M A or C R A or G W A or T S C or G Y C or T K G or T V A or C or G H A or C or T D A or G or T B C or G or T X/N G or A or T or C ./~ gap character
GelMerge GelMerge automatically recognizes overlaps among all of the sequences in a project database and creates aligned assemblies, called contigs, from the overlapping sequences. These contigs are stored in the project database. As you add new sequences that connect separate contigs to the project database, GelMerge aligns the contigs into larger assemblies. % GelMerge What word size (* 7 *) ? What fraction of the words in an overlap must match (* 0.80 *) ? What is the minimum overlap length (* 14 *) ? Reading ............ Comparing ............ Aligning ......... Writing ... Input Contigs: 12 Output Contigs: 3 CPU time: 02.29 (seconds)
Minimal Syntax: % gelmerge -Default Prompted Parameters: -WORdsize=7 sets word size for overlap determination -STRIngency=0.8 sets minimum fraction of matching words in overlap -MINOverlap=14 sets minimum length of overlap Local Data Files: -MATRix1=gelmergedna.cmp assigns the scoring matrix for contig assembly -MATRix2=gelmergelocaldna.cmp assigns the scoring matrix for vector recognition Optional Parameters: -MINIdentity=14 sets minimum run of identical bases found at least once in an overlap between two contigs -MAXGap=10 sets maximum gap size for overlap determination -GAPweight=8 sets gap creation penalty in contig assembly -LENgthweight=2 sets gap extension penalty in contig assembly -ARChive creates contigs from the original gel readings -WORKing creates contigs from individual working fragment (with gaps removed) -REPortfile[=Filename] writes report of recognized vector sequences -EXCise removes vector sequences from single-fragment contigs -VECTORSTrigency=0.8 sets minimum fraction of matches in vector recognition -VECTORMINIdentity=12 sets minimum run of identical bases found at least once in a match between vector and fragment -VECTORMAXGap=5 sets maximum gap size in first step of vector recognition -VECTORGAPweight=30 sets gap creation penalty in vector recognition -VECTORLENgthweight=3 sets gap extension penalty in vector recognition -NOMERge suppresses contig assembly -NOMONitor suppresses screen trace of program progress -NOSUMmary suppresses screen summary at the end of the program -BATch submits program to the batch queue
<ctr>D Command mode Screen mode <return> GelAssemble After assembling contigs with GelMerge, use the contig editor, GelAssemble, to review and modify the alignments. After choosing a contig for review, GelAssemble lets you edit the individual sequences in that contig to resolve inconsistencies. GelAssemble creates a consensus sequence that uses the IUB nucleotide ambiguity codes. You can modify a sequence and change the alignment in the same way you edit text with a text editor. Although GelMerge assembles and aligns contigs automatically, you can assemble contigs manually using GelAssemble. For example, you could manually assemble separate contigs that do not share sufficient overlap for GelMerge to assemble automatically. You can also separate fragments from a contig if you believe they should not be included. Once you are satisfied with a contig, you can store it in the sequencing project database. seq03 > GTTCATCAGTCTTGGTGGAGAAGTTCGACAGATGCCATTGGCAGATTTCACCGATGGTTC 220 seq01 > GTTCATCAGTCTTGGTGGAGAAGTTCGACAGATGCCATTGGCAGATTTCACCGATGGTTC 540 CONSENSUS > GTTCATCAGTCTTGGTGGAGAAGTTCGACAGATGCCATTGGCAGATTTCACCGATGGTTC 540 .........+.........+.........+.........+.........+.........+
Gelassemble Screen Mode Keys Pressed Action [n]<Right-arrow> move ahead [n bases] [n]<Left-arrow> move back [n bases] [n]<Up-arrow> move up [to row n] [n]<Down-arrow> move down [to row n] > scroll one screen to the right < scroll one screen to the left 1<Return> move to start of the sequence <Ctrl>E move to end of the sequence 165<Return> move to base 165 in sequence /GATTC<Return> find next occurrence of GATTC <Ctrl>A move to next ambiguity in alignment <Ctrl>R move to next ambiguity in sequence <Ctrl>V move to next gap in consensus <Ctrl>D enter Command Mode <Ctrl>L toggle alignment display enlargement <Ctrl>W redraw the screen <Ctrl>O toggle INSERT/OVERSTRIKE mode ! summary of current sequence ? display these help screens <Ctrl>G recalculate the consensus G A T C .... add base at the cursor <Delete> delete a base, or move sequence left <Ctrl>H delete a base, or move sequence left <Space bar> move the sequence to the right <Ctrl>X delete alignment column <Ctrl>I restore alignment column <Ctrl>B begin selecting a range for removal <Ctrl>N remove the selected range <Ctrl>P insert the removed range - reject current fragment
Gelassemble Command Mode [a,b] specifies a range of fragments. [x,y] specifies a range of bases. [n] is an optional numeric parameter. EDit [ContigName] replace current contig with a new contig CONTIGs select another contig for editing WRite write a contig to the database EXit write the contig and quit QUIT quit without writing ERASE delete current contig from the database 238 move to position 238 in the current fragment [x,y] PRETTYout [FileName] write the sequence alignment [position x - y] [a,b] SEQOUT write fragments [a - b] to sequence files BIGPICture [FileName] write bar schematic to an output file OVERstrike select OVERSTRIKE sequence edit mode NOOVERstrike select INSERT sequence edit mode [x,y] CONSensus recalculate the consensus sequence [a,b] LOCk lock strands [a through b] [a,b] Unlock unlock strands [a through b] [x,y] SELect select bases [x through y] REMove remove the selected bases [n] INSert insert the removed bases [at position n] CAncel cancel the selection
[x,y] DElete delete bases [x through y] GOTo [FragmentName] move to strand by name FInd GAATC find the next occurrence of GAATC DIfferences show differences from the consensus MAtches show matches with the consensus Neither show neither matches nor differences REDraw redraw the screen Help display these help screens SORt [DEScending] sorts strands by their offsets in alignment [a,b] MOve moves a strand [from line a to line b] OPen opens a blank line at the cursor position [a,b] ANChor anchors strands [a through b] [a,b] NOANchor unanchors strands [a through b] LOad [ContigName] loads another contig into the Edit Screen REVerse reverse-complement the (anchored) strand(s) [n] Offset shifts the current fragment [to begin at n] REJect removes the current fragment from the screen NODUPlicate removes a duplicated fragment from the screen SPAWN renames a duplicated fragment SEParate makes two contigs from anchored and unanchored strands
GelView GelView displays bar diagrams that show the overlaps among the fragments in each contig, providing a schematic view of the whole sequencing project. Gelview filename.vew. cat/more filename.view GELVIEW Fragment Assembly contig display of Project: bio May 4, 2000 17:42 Contig: seq01 3 seq03 +-------------------> 2 seq01 +-----------------------------> C CONSENSUS +------------------------------------> |----------|----------|----------|---------|---------| 0 200 400 600 800 Contig: seq04 3 seq02 <---------------+ 2 seq04 +------------> C CONSENSUS +---------------------------> |----------|----------|----------|---------|---------| 0 400 800 1200 1600 Contig: seq05 2 seq05 +----------------------------> C CONSENSUS +----------------------------> |----------|----------|----------|---------|---------| 0 200 400 600 800 5 Fragments in 3 Contigs
GelDisassemble GelDisassemble breaks up the contigs in a sequencing project, thus recreating the database as a collection of single fragments. % geldisassemble Are you sure you want to disassemble your project (* No *) ? Yes 1) Emptying "relation" directory.... 2) Emptying "consensus directory.... 3) Copying "working" to "consensus".... 4) Creating "relation".... Gel Project Disassembled
Exercise 91-08 • Download Bioinfo91-08.exe • Decompress the file • Start CuteFTP Transfer the files seq01.txt-seq10.txt to GCG • Start GCG FAS • Questions: • What is the correct order of the assembled sequence? • Which putative protein this sequence encodes? • Is there any potential regulatory elements upstream of the gene? • (4) What is the identity with the human protein?
Protein-mRNA-Gene mRNA-Protein-Gene Gene-mRNA-Protein Download Bioinfo91-08.exe Decompress the file You will found the following files in FASTA format: Gene.txt Protein.txt RNA.txt Is there any standard procedures?
Gene-mRNA-Protein FILE PROCESSING (Trace File Viewer & Format Converter) OPEN READING FRAME DNA RNA Reverse or Directional 2nd Structure RESTRICTION MAPPING HOMOLOGY SEARCH FASTA, BLASTn, BLASTx Bestfit, gap, pileup ALIGNMENT MOTIF SEARCH