RAJESH KUMAR Ph.D 1 st yr Dairy Microbiology Division N.D.R.I

Genome & Protein “ Sequence Analysis Programs” application in establishing Epidemiology and Variability RAJESH KUMARPh.D 1st yrDairy Microbiology DivisionN.D.R.I

Introduction • Bio-informatics/Computational Biology:- • Proteomics:-Large-scale study of proteins. • Genomics:-study of an organism’s genome and use of genes. • Comparative Genomics:-comparison of genomes. • Structural Genomics:-determination of tridimensional structure of all proteins of a given organism.

Major Research efforts of Bio-informatics:- • Sequence analysis / alignment. • gene finding. • genome assembly. • protein structure alignment. • protein structure prediction. • prediction of gene expression and protein-protein interactions. • modeling of evolution.

Sequence Analysis Encompasses the use of various bioinformatic methods to determine the biological function and structure of genes and the proteins. DNA sequences  Decoded  Stored in electronic databases  Analysis  Phylogenetic Tree Comparative Genomics 

Shotgun Sequencing Used in genetics for sequencing long DNA strands. DNA  small segments  sequenced  Computer programs Sequence Alignment:- arrangement of two or more sequences & highlighting their similarity. tcctctgcctctgccatcat---caaccccaaagt |||| ||| ||||| ||||| |||||||||||| tcctgtgcatctgcaatcatgggcaaccccaaagt

Structural Alignment More reliable over long evolutionary distances. Useful in identifying structurally-conserved regions. Multiple Alignment extension of pairwise alignment to incorporate more than two sequences into an alignment. help in the identification of common regions between the sequences. Programs Clustal is used in cladistics to build phylogenetic trees

Framesearch It is extension of Smith-Waterman, for pairwise alignment between a protein sequence and a nucleotide sequence. It dynamically considers every possible single-nucleotide insertion or deletion to generate the translation that best matches the protein sequence. Software:- Ssearch Smith-Waterman remains the gold standard for protein-protein or nucleotide-nucleotide pairwise alignment.

BLAST • An algorithm for comparing biological sequences. • Widely used tools for searching protein and DNA databases for sequence similarities. • It gives answers of following questions:- • Which bacterial species have a protein that is related in lineage to a certain protein whose amino-acid sequence I know? • Where does the DNA that I've just sequenced come from? • . What other genes encode proteins that exhibit structures or motifs such as the one I've just determined?

To run, BLAST requires two sequences as input: • a query sequence or target sequence • a sequence database. • Search for high scoring sequence alignments. • Three stages of BLAST:- • 1st stage, BLAST searches for exact matches of a small fixed length W between the query and sequences in the database. • 2nd stage, BLAST tries to extend the match in both directions, starting at the seed. • If a high-scoring ungapped alignment is found, the database sequence is passed on to 3rd stage .

In 3rd stage BLAST performs a gapped alignment between the query sequence and the database sequence • Alternative to BLAST is BLAT (Blast Like Alignment Tool). • FASTA:- • Slower but more sensitive than BLAST. • DNA and Protein sequence alignment software package. • The original FASTP program was designed for protein sequence similarity searching. • FASTA provided a more sophisticated shuffling program for evaluating statistical significance.

Programs in this package:- • "FAST-Aye", and stands for "FAST-All“. • "FAST-P" (protein) alignment. • "FAST-N" (nucleotide) alignment. Current FASTA package contains programs for:- • protein:protein • DNA:DNA. • Protein:translated DNA • Ordered or unordered peptide searches. Recent versions of the FASTA package include special translated search algorithms that correctly handle frameshift errors when comparing nucleotide to protein sequence data.

Clustal Clustal is a widely used multiple alignment computer program. i) ClustalW ii) ClustalX Sequence Analysis Programmes:- EMBOSS European Molecular Biology Open Software Suite (EMBOSS) is a program suite for nucleic acid and protein sequence analysis. EMBOSS programs manipulate, analyze, and display nucleic acid and protein sequences. Similar in functionality to the commercial GCG Wisconsin Software.

PhyloGibbs Designed to identify where these regulatory molecules bind to DNA. PhyloGibbs compares DNA from multiple species in order to identify areas in which the genetic code is statistically similar and filter segments that are most likely to be of interest to scientists. AutoEditor : Automated correction of sequencing and basecaller errors a tool for correcting sequencing and basecaller errors using sequence alignment and chromatogram data. On average AutoEditor corrects 80% of erroneous base calls. It also greatly improves our ability to discover SNPs between closely related strains and isolates of the same species.

MUMmer • System for aligning whole genome sequences. Using an efficient data structure called a suffix tree, the system is able rapidly to align sequences containing millions of nucleotides. • MUMmer 3.0 • Open source. • Improved efficiency. • Ability to find non-unique, repetitive matches as well as unique matches. • New graphical output modules. Applications:- • MUMmer 1.0 was used to detect numerous large-scale inversions in bacterial genomes.

MUMmer 2.1 was used to align all human chromosomes to one another and to detect numerous large-scale. • PROmer was used to compare the human and mouse malaria parasites P.falciparium and P.yoelii. Current use of MUMmer 3.0:- • Identifying SNPs and other mutations in a large collection of Bacillus anthracis strains. 2) Comparing different assemblies of the same genome at different stages of sequencing and finishing.

E.coli K12 vs. E.coli O157:H7 • S.cerevisiae vs. S.pombe • A.fumigatus vs. A.nidulans • P.falciparum vs.P.yoelii • PSORT WWW Server • PSORT is a computer program for the prediction of protein localization sites in cells. • WoLF PSORT • WoLF PSORT Prediction • PSORT II (Recommended for animal/yeast sequences) • PSORT II Users' Manual • PSORT II Prediction • PSORT (Old version; for bacterial/plant sequences) • PSORT-B (Recommended for Gram-negative bacteria) • PSORT-B Prediction • PSORT-B, a program applicable to the sequences of Gram-negative bacteria.

PSORT Prediction Source of Input Sequence: Gram-positive bacterium Gram-negative bacterium yeast animal plant Sequence ID (Default is MYSEQ): Enter your Amino Acid sequence below (by copy & paste): Characters except the standard 20 codes will be removed off To submit the query, press this button: Submit

PHIRE • This Visual Basic program performs an algorithmic string-based search on bacteriophage genome sequences. • Discovering and extracting blocks displaying sequence similarity, without any prior experimental or predictive knowledge. • MB Advanced DNA Analysis • MB is relatively small and easy to use program. • Main features of MB are: • restriction analysis • amino acids analysis • multiple sequence alignment tool • dot plot • calculation of molecular weights and chemical properties of proteins • prediction of 3D structures for small amino acids sequences.

UniPro DPview This is a tool for finding and analyzing matches between genomes. SEQtools Program package for routine handling and analysis of DNA and protein sequences. The package includes general facilities for sequence and contig editing, restriction enzyme mapping, translation, and repeat identification. DNA Club DNA analysis software, Features:- remove vector sequence, find ORF, sequence editing, translate to protein sequence, protein sequence editing, RE Map, RE Map with translation, PCR primer selection, primer or probe evaluation.

ZCURVE • New highly accurate system for recognizing protein coding genes in bacterial and archaeal genomes based on the Z curve theory of DNA sequence. • DNA for Windows • is a compact, easy to use DNA analysis program, ideal for small-scale sequencing projects. • Webcutter • is a free on-line tool to help restriction map nucleotide sequences. • Features:- • a simple, customizable interface • worldwide platform-independent accessibility via the web • seamless interfaces to NCBI's GenBank • DNA sequence database • restriction enzyme database.

Multilocus sequence typing (MLST) Compares sequence variation in numerous housekeeping gene targets. Developed for Neisseria gonorrhoeae, Streptococcus pneumoniae, andS. aureus. Based on the classic multilocus enzyme electrophoresis (MLEE) method used to study the genetic variability of a species. Drawbacks:- labor-intensive, time-consuming, and costly.

Single-locus sequence typing(SLST) compares sequence variation of a single target. provides an inexpensive, rapid, objective, and portable genotyping method to subspeciate bacteria. Using a single target depends on finding a region for sequencing that is sufficiently polymorphic to provide useful strain resolution. Loci with short sequence repeat (SSR) regions may have suitable variability for discriminating outbreaks.

Two S. aureus genes conserved within the species, protein A (spa) and coagulase (coa), have variable SSR regions constructed from closely related 24- and 81-bp tandem repeat units, respectively. The genetic alterations in SSR regions include both point mutations and intragenic recombination that arise by slipped-strand mispairing during chromosomal replication and that result in a high degree of polymorphism.

RAJESH KUMAR Ph.D 1 st yr Dairy Microbiology Division N.D.R.I