Understanding Bioinformatics: From DNA to Proteins and Multiple Sequence Alignments

BIOINFORMATICS REFERENCES Your name and the names of the people who have contributed to this presentation go here.The names and addresses of the associated institutions go here. OPTIONALLOGO HERE OPTIONALLOGO HERE BIOLOGICAL SEQUENCE Multiple-Sequence Alignments (MSAs) • Turning DNA into Proteins:The Genetic Code • DNA gets transcribed into RNA using nucleotide complimentarily. • RNA gets translated into proteins using the genetic code: • UCU UAU GCG UAA • SER-TYR-ALA-STOP • BIOINFORMATICS is about searching biological databases, comparing sequences, looking at protein structures, and (more generally) asking biological and biomedical questions with a computer. It is the computational branch of molecular biology. • ANALYZING PROTEIN SEQUENCE • Steak eating familiarizes you with protein • Proteins are found in both fish and vegetables • They are made up of the same basic building blocks known as AminoAcids – these are complex organic molecules, called carbon, hydrogen, oxygen, nitrogen, and sulfur atoms. • Proteins • Proteins are like small machines in the cell. Proteins carry out most of the work in a cell. • Proteins are synthesized from RNA sequences. Proteins are like small machines in the cell. • Proteins carry out most of the work in a cell. Proteins are synthesized from RNA sequences. • Amino Acids • Proteins are made of 20 amino acids. • Each amino acid is small molecule made up of fewer than 100 atoms. • The 20 amino acids have similar terminations; they can be chained to one another like Lego bricks. • Protein Sequences • Proteins are made of amino acids chained by peptide bonds. • Protein sequences are written from the N to the C-terminus. • Your average protein is 400 amino acids long. • The longest protein is 30,000 amino acids long. • Proteins have well-defined 3-dimensional structures. • Hydrophobic amino acids are in the protein’s core. • Hydrophilic amino acids are on the protein’s surface. • Protein Structures • Proteins have well-defined 3-dimensional structures. • Hydrophobic amino acids are in the protein’s core. • Hydrophilic amino acids are on the protein’s surface. • DNA: DeoxyriboNucleic Acid • Genomes and genes are made of DNA • DNA is the main support of heredity • DNA Sequences • DNA sequences are made of 4 nucleotides • Adenine A • Guanine G • Cytosine C • Thymine T • DNA Sequences can be very long • Human chromosomes contain hundreds of millions of nucleotides • Nucleotides • Nucleotides have similar terminations. • Nucleotides are meant to be chained like Lego bricks. • Nucleotides can interact with each other: • Adenine with thymine (A with T) • Guanine with cytosine (G with C) • A tiny bacterium can contain a genome of several million nucleotides • Double-strand DNA • DNA sequences always come in two strands. • The strands are complementary and opposite in orientation. • By convention, biologists write only the 5’ and 3’ strands. • Database-search programs search both strands automatically . • RNA: Ribonucleic Acid • RNA is a close relative of DNA • RNA has many functions • Provides coding for proteins • Helps synthesize proteins • Helps many basic processes in the cell • RNA is not very stable • RNA is synthesized and very often degraded • DNA, by contrast, is very stable • The RNA Sequence • RNA contains 4 nucleotides: • A, G, C, U • U is Uracil • RNA does not contain Thymine (T) • Uracil replaces Thymine in RNA • RNA is single-stranded • RNA Secondary Structures • RNA can make secondary structures • RNA can make 1 strand with itself as a secondary structure • Secondary structures are made of stems and loops Multiple alignments reveal common features between sequences Multiple alignments are useful for :- Comparing very different sequences, Making phylogenetic trees, Making structure predictions Multiple-sequence alignments are abbreviated as MSAs Making an MSA with M-Coffee Open www.tcoffee.org Click MCoffee::Regular Cut and paste your sequences Submit your MSA Making Sense of Your MSA Positions are marked: Completely conserved = asterisk ( * ) Highly conserved = colon (:) Conserved = period (.) Look for highly conserved blocks: The red box on this slide shows a highly conserved block. These blocks are often functionally important positions. PubMed/Medline • PubMed is a database containing all the recent scientific publications in biology • PubMed is free • You can search PubMed using any keyword you are interested in. • Openwww.ncbi.nlm.nih.gov/pubmed • Type your favorite keywords • Press Return or Enter • Click the Limits tab • Check the boxes you are interested in, such as • Review • English • AIDS • Restrict the search with fields • [AU] Author • [SO] Source (journal) • [TI] Title • [AD] Address • [MH] Keywords • The words will be searched only in the corresponding fields • Medline contains only papers published after 1965 • Use no more than 10 names for papers before 1995 typical Prokaryotic Genome PROKARYOTIC ORGANISMS - are organisms lacking a true nucleus. EUKSRYOTIC ORGANISMS - are organisms having a true nucleus. GENE – is defined as the contiguous genome segment encompassing all the nucleotide-sequence information necessary to bring about its successful expression – that is, the production of protein or RNA. The 3 most basic classes of living organism are the - PROKARYOTES – such as bacteria, ARCHAEA – these are bacteria-like organisms living in extreme conditions), and THE EUKARYOTES – going from microscopic yeast to humans, animals, and plants. FOR BIOINFORMATICS – Prokaryotes and Achaea are very much the same – with few exceptions. Prokaryotes vs. Eukaryotes • Prokaryotes • Genome=one large circular chromosome + a few small circular chromosomes (plasmides) • 0.5 to 8 Mb / chromosome • Genes in one piece • 70% of the genome is coding • 1 gene / Kb • Eukaryotes • Genome= many large linear chromosomes • 10 to 700 Mb / chromosome • Genes split • 5% of the genome is coding • 1 gene/ 100 Kb (Human) Typical Prokaryotic Protein - Coding Gene • The gene has an uninterrupted sequence • Prokaryotic mRNA contains • The Ribosome Binding Site (RBS) • The Open Reading Frame (ORF) in one piece • In operons, the RNA can contain several ORFs • Eukaryotes can be small (yeast) or big (whales) • Genomes are made of linear pieces of DNA called chromosomes • One chromosome: 10 to 700 Mb • The Human Genome • Contains 22+1 chromosomes • Is 3 Gb long • One gene every 100 Kb (human) • 5 % of the genome is coding for proteins Retrieving Protein Sequences in Swiss-Prot • Swiss-Prot is a database containing all the proteins with known functions • Swiss-Prot is available from the ExPAsy server atwww.expasy.ch/sprot/ • ExPASy: Expert Protein Analysis System • ExPASy contains many useful online tools • Each Swiss-Prot entry is dedicated to a protein • A Swiss-Prot entry summarizes everything that is known about a given protein • The entry contains functional information and links to other databases mentioning this protein • Looking for DNA Sequences • There are many types of DNA sequences • The most common are • Regulatory regions, often before genes • Untranslated regions, often around the genes • Protein-coding regions • Intergenic regions (between the genes) • All these sequences can be found in GenBank GenBank • Housed by the National Center for Biotechnologies (NCBI) • GenBank is the memory of biological science • Contains EVERY DNA sequence ever published • GenBank is the original information source for most biological databases • GenBank is more complicated to use than gene-centric databases Reading a Prokaryotic GenBank Entry • ACCESSION is the accession number • Unique to each entry • Permanent • LOCUS contains information on gene size • ORGANISM Defines the organism containing the gene • REFERENCE indicates who produced the sequence • FEATURES lists some functional features of the gene • GenBank entries can contain more than one gene Fetching a DNA Sequence at the NCBI Exploring the Human Genome with ENSEMBL • Navigate to www.ncbi.nlm.nih.gov/Genbank/ • Type in a keyword. • Press Return or Enter. You get a list of entries matching your keyword. • Point, click, and explore… • Accessible at www.ensembl.org • ENSEMBL is a database of eukaryotic genomes • Annotated entries • Wide range of examples: human, mouse, dog, and so on • ENSEMBL annotation is mostly automated • ENSEMBL contains tools to • Browse the complete genome • Search the complete genome with BLAST • Visualize the position of a gene • Visualize all experimental information on this gene (transcripts) • By pointing on a chromosome region you can zoom inside the chromosome • All genes are cross-indexed with databases so you can find all related experimental information

Understanding Bioinformatics: From DNA to Proteins and Multiple Sequence Alignments

Understanding Bioinformatics: From DNA to Proteins and Multiple Sequence Alignments

Presentation Transcript

REFERENCES

References

References

References

References

References

References

REFERENCES

References

References

References

References

References

References

References

References

References

References

References

References

References: