DNA sequence analysis

DNA sequence analysis School B&I TCD Bioinformatics May 2010

A, T/U, C, G • Simple code, lots of sequence • Sequence analysis • Computer intensive • BLAST homology searching • Gene/exon prediction • Multiple sequence alignment • Alignments in general • “Trivial”

Trivial • Could be done by hand • Computers • Quicker • More reliable • Examples • Translate DNA • Restriction sites • Synonymous codon usage

Sequence formats • Fasta Format >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX • Phylip Format 4 131 IXI_234 TSPASIRPPA GPSSRPAMVS SRRTRPSPPG PRRPTGRPCC SAAPRRPQAT IXI_235 TSPASIRPPA GPSSR----- ----RPSPPG PRRPTGRPCC SAAPRRPQAT IXI_236 TSPASIRPPA GPSSRPAMVS SR--RPSPPP PRRPPGRPCC SAAPPRPQAT IXI_237 TSPASLRPPA GPSSRPAMVS SRR-RPSPPG PRRPT----C SAAPRRPQAT • CLUSTAL W(1.4) multiple sequence alignment IXI_234 TSPASIRPPAGPSSRPAMVSSRRTRPSPPGPRRPTGRPCCSAAPRRPQAT IXI_235 TSPASIRPPAGPSSR---------RPSPPGPRRPTGRPCCSAAPRRPQAT IXI_236 TSPASIRPPAGPSSRPAMVSSR--RPSPPPPRRPPGRPCCSAAPPRPQAT IXI_237 TSPASLRPPAGPSSRPAMVSSRR-RPSPPGPRRPT----CSAAPRRPQAT • Interconvert: http://thr.cit.nih.gov/molbio/readseq/

DNA sequence analysis • Google EMBOSS • A suite of programs with the same look&feel • Does pretty much everything you need • Can be installed locally

Translation • DNA anti-parallel. • One strand 5’ -3’ matches the complementary strand 3’ – 5’ • Translation, transcription always 5’ – 3’ • Six possible translations, 3 each strand • ATGCCCGCATTTGAATAA • ATGCCCGCATTTGAATAA • ATGCCCGCATTTGAATAA • Stop codons underlined Frameshift errors Frameshift mutations

Genetic code The “Universal” Genetic Code. Phe UUU Ser UCU Tyr UAU Cys UGU UUC UCC UAC UGC Leu UUA UCA ter UAA ter UGA UUG UCG ter UAG Trp UGG Leu CUU Pro CCU His CAU Arg CGU CUC CCC CAC CGC CUA CCA Gln CAA CGA CUG CCG CAG CGG Ile AUU Thr ACU Asn AAU Ser AGU AUC ACC AAC AGC AUA ACA Lys AAA Arg AGA Met AUG ACG AAG AGG Val GUU Ala GCU Asp GAU Gly GGU GUC GCC GAC GGC GUA GCA Glu GAA GGA GUG GCG GAG GGG

Exceptions to the code • #1: Yeast Mitochondrial Code: CUN=T AUA=M UGA=W • #2: Mitochondrial Code of Vertebrates: AGR=* AUA=M UGA=W • #3: Mitochondrial Code of Filamentous fungi: UGA=W • #4: Mitochondrial Code of Insects and platyhelminths: AUA=M UGA=W AGR=S • #5: Nuclear Code of Candida cylindracea: CUG=S (*) • #6: Nuclear Code of Ciliata: UAR = Q • #7: Nuclear Code of Euplotes: UGA=C • #8: Mitochondrial Code of Echinoderms: UGA=W AGR=S AAA=N • #9: Mitochondrial Code of Ascidaceae: UGA=W AGR=G AUA=M • #10: Mitochondrial Code of Platyhelminthes: UGA=W AGR=S UAA=Y AAA=N • #11: Nuclear Code of Blepharisma: UAG=Q (*) (see Nature 341:164):

Start codons • ATG the “universal” start codon … but • 10% E.coli genes start with GTG • 1% start with TTG. • Bioinformaticians only make predictions • Molecular biologists verify

Restriction sites • Essential for the construction of plasmids • A key tool for molecular biology • Hundreds available commercially • Need to decide which to order • Costs from $3.80/1000units - $500/1000 • http://tools.neb.com/NEBcutter2/index.php • Usually need an enzyme that cuts once EcoR1 5'G’AATTC 3'CTTAA’G BluntEnd BamH1 5'G’GATCC 3'CCTAG’G Alu1 5'AG’CT 3'TC’GA

Promoter Prediction • To find start of transcript (97% Human genome not coding) • False positive rate too high • Predicted 1 / kb gene-density 1 / 100kb • RNA polII transcribes DNA – RNA • Needs general transcription factors (GTFs) • Also specific (species, tissue, devt stage) TF • TF binding sites short and “fuzzy” • 7% of vertebrate genes are TFs

Promoters 2 A00333001 C12000002 G00000110 T21000220 TCAAATTC NF-AT4 matrix (3 known sites) and consensus: Consensus YYAAAKKM = [CT](2)AAA[GT](2)[AC] Predicts five sites in 3Kb upstream of human IL-11: Bp 007 TTAAAGGC Bp 248 ACAAATTC Bp1959 GAGTTTGA Bp2154 TCAAAGGA Bp2181 GACTTTTA Ask if TF site relevant to your cell type is present.

Primer design • You will be asked to design primers for sequencing, PCR etc. • Manual pages cover this • Computationally trivial, so lots of choice for available websites

Not-trivial • NA secondary structure • EMBOSS einverted for short palindromes • mFOLD • Huge database of 16sRNA structures • miRNA sites

Secondary Structure • DNA (and RNA) can form base-pairs. • Not all of these are with complementary strands. Closer to reality Bioinformatic view = a cartoon

16s RNA Gram -ve Gram +ve Evolutionary consequences? Coordinated/dependent mutational change

RDP • Ribosomal Database Project-II Release 9 Notes • RDP Release 9.42 (Release 9, update 42) consists of 262,030 aligned and annotated 16S rRNA sequences, along with five online analysis tools.

DNA sequence analysis