1 / 28

Sequence Analysis

Sequence Analysis. Sequence Analysis - Topics. Comparison of gene sequences for similarities and defining homologies from phylogenetic analysis Identification of gene structure, including reading frames, exon-intron distribution and regulatory elements Prediction of protein structural elements

fritz
Télécharger la présentation

Sequence Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Analysis

  2. Sequence Analysis - Topics • Comparison of gene sequences for similarities and defining homologies from phylogenetic analysis • Identification of gene structure, including reading frames, exon-intron distribution and regulatory elements • Prediction of protein structural elements • Genome mapping (linear arrangement of genes on chromosomes and its assessment within the context of metabolic pathways

  3. Sequence Analysis • Helps understand evolution of life • Expresses Relationship between DNA sequences of different proteins and organisms • Facilitates the collection, storage, organization and annotation of raw data and construction of secondary and tertiary databases • Necessary to achieve the goal of bioinformatics

  4. Goal of Bioinformatics • Organization of sequence databases with bibliographic and biological annotations • Support via software for the alignment of sequences • Identification of genes • Translation of DNA sequences into amino acid sequences • Search for homologs (evolutionary related sequences)

  5. History of Seq Analysis • Fifteen years ago – People read DNA or Amino acid Sequences over telephone • Caused an estimated mutation rate that far exceeded that of natural DNA replication or transcription process

  6. Computational tools for Sequence Analysis • Extremely easy • Fast • Virtually error free

  7. Database Submissions • Information is submitted to NCBI, EBI, DDBJ • GenBank staff scientists assign accession numbers for immediate release to public • Daily exchanges between GenBank, EBI and DDBJ ensure information is non redundant (submitted only once)

  8. Database submissions • Authors can update original information • Specialized submission procedures include EST(Expressed sequence tags), STs(Sequence tagged sites) and GSSs(Genome survey sequences)

  9. EST (Expressed sequence tags) • ESTs are short sequences of 300-500 bp and represent actually expressed genes. • These are markers that are helpful in locating (map) genes on chromosomes • EST submissions therefore include both sequence and mapping information

  10. STs (Sequence tagged sites) • Provide unique identifiers within a given genome identifiable by PCR • Similar in length and number of submitted sequences per batch • STs sequences will soon outnumber EST because of the non coding regions of the genomes

  11. Processing of submissions • The submissions are processed on a daily basis and can be submitted before they are completed • The processing at NCBI includes 3 phases: 1) Unfinished , Unordered 2) unfinished ordered 3) high-quality finished sequences with no gaps

  12. Annotation • Annotation of sequences is important – helps in predicting structures, drug discovery , establishing phylogenetic relationships etc • Erroneous annotation result in erroneous interpretation and conclusions and reduces reliability of data • NCBI’s staff continously screen biomedical journals for published sequence and structure data and use it for annotation purposes

  13. Data Retrieval • Data for DNA and Protein sequences – enormous – searching is dubbed “biological data mining”. • Sequences are retrieved based on specific criteria (similarity or identity between sequences)

  14. Search Engines • Perform simple string searches for information retrieval of stored data (GenBank:nucleotides and proteins; and PubMed’s MEDLINE: 3-D structures, genomes and taxonomy databases) • Perform similarity searches (e.g., BLAST) to retrieve , align and compare sequences or structures

  15. Steps in Retrieval • First step includes retrieving sequences based on specific criteria (similarity or identity between sequences) • If no sequence is known or available, the NCBI’s search engine can be screened at the nucleotide or protein level by typing in the keyword – the name of protein, the author or the proper accession number

  16. Results of Data retrieval • The level of reported similarity indicates potential biological relationships across species and taxonomic divisions • Identities between sequences are measured as E-values between zero and one indicating chance of a random hit • A value of one indicates potential randomness while values of zero or close to zero are less likely to be random hits

  17. Sequence Alignment • Pair-wise comparison of sequences • First step in assessing the property of a newly sequenced gene • Finding homologs in other organisms • Identifying new sequences as novel • BLAST 2 – Compare two sequences • ClustalW – Multiple sequence alignment

  18. Results of Sequence Alignment • Several sequences can be submitted and different output settings can be selected • Identities from pair wise alignments are shown • Order of most identical to least identical sequence pairs are also shown • Phylogenetic trees (graphical description) are also included

  19. What Sequence Reveals • The Biological function of a Gene • Related sequences in database • Structure prediction / comparison with X-ray structure • ORF (open reading frame) if function is unknown • Domain structure

  20. What Sequence Reveals • Transmembrane segments • Signal sequence • Alternate nomenclature • Genetic information – regulatory sequences • Translation • 2-D gels, pI (charge), molecular weight • Bibliography

  21. Identification of Gene • Software identifies ORFs (Open reading frames) or URFs (unidentified reading frames) • Searches for long streches of sequence between a start and a stop codon • The length of the ORF directly related to the size or molecular weight of the coded protein • The comparison of the similarity of two or more sequences is a good indicator of biological function of gene

  22. Redundancy • Scientists work independently – results in repetitive naming of identical genes and proteins • Similar to having name listed as 3 entries in a telephone book - first , middle and last name • Redundancy is useful - an unintentional quality control

  23. Human Genome Project • The ultimate physical map of the human genome is the complete DNA sequence the determination of all base pairs on each chromosome. The completed map will provide biologists with a Rosetta stone for studying human biology and enable medical researchers to begin to unravel the mechanisms of inherited diseases. • A major focus of the Human Genome Project is the development of automated sequencing technology that can accurately sequence 100,000 or more bases per day at a cost of less than $.50 per base. Specific goals include the development of sequencing and detection schemes that are faster and more sensitive, accurate, and economical.

  24. Human Genome Project • Second-generation (interim) sequencing technologies will enable speed and accuracy to increase by an order of magnitude (i.e., 10 times greater) while lowering the cost per base. Some important disease genes will be sequenced with such technologies as • (1) high-voltage capillary and ultra thin electrophoresis to increase fragment separation rate and • (2) use of resonance ionization spectroscopy to detect stable isotope labels.

  25. Human Genome Project • Third-generation gel-less sequencing technologies, which aim to increase efficiency by several orders of magnitude, are expected to be used for sequencing most of the human genome. These developing technologies include • (1) enhanced fluorescence detection of individual labeled bases in flow cytometry, • (2) direct reading of the base sequence on a DNA strand with the use of scanning tunneling or atomic force microscopies, • (3) enhanced mass spectrometric analysis of DNA sequence, and • (4) sequencing by hybridization to short panels of nucleotides of known sequence.

More Related