Sequence Analysis

Sequence Analysis

Sequence Analysis - Topics • Comparison of gene sequences for similarities and defining homologies from phylogenetic analysis • Identification of gene structure, including reading frames, exon-intron distribution and regulatory elements • Prediction of protein structural elements • Genome mapping (linear arrangement of genes on chromosomes and its assessment within the context of metabolic pathways

Sequence Analysis • Helps understand evolution of life • Expresses Relationship between DNA sequences of different proteins and organisms • Facilitates the collection, storage, organization and annotation of raw data and construction of secondary and tertiary databases • Necessary to achieve the goal of bioinformatics

Goal of Bioinformatics • Organization of sequence databases with bibliographic and biological annotations • Support via software for the alignment of sequences • Identification of genes • Translation of DNA sequences into amino acid sequences • Search for homologs (evolutionary related sequences)

History of Seq Analysis • Fifteen years ago – People read DNA or Amino acid Sequences over telephone • Caused an estimated mutation rate that far exceeded that of natural DNA replication or transcription process

Computational tools for Sequence Analysis • Extremely easy • Fast • Virtually error free

Database Submissions • Information is submitted to NCBI, EBI, DDBJ • GenBank staff scientists assign accession numbers for immediate release to public • Daily exchanges between GenBank, EBI and DDBJ ensure information is non redundant (submitted only once)

Database submissions • Authors can update original information • Specialized submission procedures include EST(Expressed sequence tags), STs(Sequence tagged sites) and GSSs(Genome survey sequences)

EST (Expressed sequence tags) • ESTs are short sequences of 300-500 bp and represent actually expressed genes. • These are markers that are helpful in locating (map) genes on chromosomes • EST submissions therefore include both sequence and mapping information

STs (Sequence tagged sites) • Provide unique identifiers within a given genome identifiable by PCR • Similar in length and number of submitted sequences per batch • STs sequences will soon outnumber EST because of the non coding regions of the genomes

Processing of submissions • The submissions are processed on a daily basis and can be submitted before they are completed • The processing at NCBI includes 3 phases: 1) Unfinished , Unordered 2) unfinished ordered 3) high-quality finished sequences with no gaps

Annotation • Annotation of sequences is important – helps in predicting structures, drug discovery , establishing phylogenetic relationships etc • Erroneous annotation result in erroneous interpretation and conclusions and reduces reliability of data • NCBI’s staff continously screen biomedical journals for published sequence and structure data and use it for annotation purposes

Data Retrieval • Data for DNA and Protein sequences – enormous – searching is dubbed “biological data mining”. • Sequences are retrieved based on specific criteria (similarity or identity between sequences)

Search Engines • Perform simple string searches for information retrieval of stored data (GenBank:nucleotides and proteins; and PubMed’s MEDLINE: 3-D structures, genomes and taxonomy databases) • Perform similarity searches (e.g., BLAST) to retrieve , align and compare sequences or structures

Steps in Retrieval • First step includes retrieving sequences based on specific criteria (similarity or identity between sequences) • If no sequence is known or available, the NCBI’s search engine can be screened at the nucleotide or protein level by typing in the keyword – the name of protein, the author or the proper accession number

Results of Data retrieval • The level of reported similarity indicates potential biological relationships across species and taxonomic divisions • Identities between sequences are measured as E-values between zero and one indicating chance of a random hit • A value of one indicates potential randomness while values of zero or close to zero are less likely to be random hits

Sequence Alignment • Pair-wise comparison of sequences • First step in assessing the property of a newly sequenced gene • Finding homologs in other organisms • Identifying new sequences as novel • BLAST 2 – Compare two sequences • ClustalW – Multiple sequence alignment

Results of Sequence Alignment • Several sequences can be submitted and different output settings can be selected • Identities from pair wise alignments are shown • Order of most identical to least identical sequence pairs are also shown • Phylogenetic trees (graphical description) are also included

What Sequence Reveals • The Biological function of a Gene • Related sequences in database • Structure prediction / comparison with X-ray structure • ORF (open reading frame) if function is unknown • Domain structure

What Sequence Reveals • Transmembrane segments • Signal sequence • Alternate nomenclature • Genetic information – regulatory sequences • Translation • 2-D gels, pI (charge), molecular weight • Bibliography

Identification of Gene • Software identifies ORFs (Open reading frames) or URFs (unidentified reading frames) • Searches for long streches of sequence between a start and a stop codon • The length of the ORF directly related to the size or molecular weight of the coded protein • The comparison of the similarity of two or more sequences is a good indicator of biological function of gene

Redundancy • Scientists work independently – results in repetitive naming of identical genes and proteins • Similar to having name listed as 3 entries in a telephone book - first , middle and last name • Redundancy is useful - an unintentional quality control

Human Genome Project • The ultimate physical map of the human genome is the complete DNA sequence the determination of all base pairs on each chromosome. The completed map will provide biologists with a Rosetta stone for studying human biology and enable medical researchers to begin to unravel the mechanisms of inherited diseases. • A major focus of the Human Genome Project is the development of automated sequencing technology that can accurately sequence 100,000 or more bases per day at a cost of less than $.50 per base. Specific goals include the development of sequencing and detection schemes that are faster and more sensitive, accurate, and economical.

Human Genome Project • Second-generation (interim) sequencing technologies will enable speed and accuracy to increase by an order of magnitude (i.e., 10 times greater) while lowering the cost per base. Some important disease genes will be sequenced with such technologies as • (1) high-voltage capillary and ultra thin electrophoresis to increase fragment separation rate and • (2) use of resonance ionization spectroscopy to detect stable isotope labels.

Human Genome Project • Third-generation gel-less sequencing technologies, which aim to increase efficiency by several orders of magnitude, are expected to be used for sequencing most of the human genome. These developing technologies include • (1) enhanced fluorescence detection of individual labeled bases in flow cytometry, • (2) direct reading of the base sequence on a DNA strand with the use of scanning tunneling or atomic force microscopies, • (3) enhanced mass spectrometric analysis of DNA sequence, and • (4) sequencing by hybridization to short panels of nucleotides of known sequence.

Sequence Analysis