540 likes | 628 Vues
Explore the interdisciplinary field of bioinformatics where biology and computer science merge, with a focus on molecular biology fundamentals, protein structures, and nucleic acids. Learn key skills, tools, and techniques in this cutting-edge discipline.
E N D
CIS 667 Bioinformatics Cleveland State University Department of Computer and Information Science Fall 2003
What is Bioinformatics? • Field of science in which biology, computer science, information technology merge to form a single discipline • Historically, creation/maintenance of biological sequence databases important • Biology is being transformed from a purely lab-based science to an information science as well
What is Bioinformatics? • Three important sub-disciplines • Development of new algorithms and statistical methods to analyze relationships among members of large data sets • Analysis and interpretation of various types of data (nucleotide and amino acid sequences, protein structures, etc.) • Development/implementation of tools for efficient access/mgmt. of various types data
Why now? • Recent advances in molecular biology and genomic technologies lead to an explosive growth in the amount of biological information generated • Requires computerized databases to store/organize/index data and specialized tools to view and analyze data
What skills should a Bioinformatician have? • Deep background in some area of molecular biology • Understand the central dogma of molecular biology • Substantial experience with at least one or two major packages • Experience working in a command-line computing environment • Experience with both high-level and scripting languages
Others… • Molecular Evolution • Physical chemistry • Statistics and probability • Database design • Algorithm development • Molecular biology lab methods
What will we learn? • Central dogma of molecular biology + other necessary biology background • Working in a Unix command-line environment • Programming in Perl • Algorithms for molecular biology • Hands-on experience with bioinformatics tools
Molecular Biology • Primarily concerned with two basic molecules of all living things: • Proteins • Structural proteins are tissue building blocks while enzymes catalyze chemical reactions • Proteins are chains of amino acids
Side Chain Alpha Carbon Carboxy Group Amino Group Example Amino Acid CH3 C H2N COOH H
Amino Acids • There are 20 naturally occurring amino acids • Amino acids can be identified by a 3-letter code (and sometimes by 1-letter code) • In a protein, amino acids are joined by peptide bonds (C from carboxy group binds to N from amino group) • A water molecule is liberated so we speak of residues in the chain
Proteins • Typical protein contains about 300 residues • Chain have an amino group at one end and a carboxy group at the other giving the chain an orientation (start - end) • The sequence of residues in the chain is called the protein’s primary structure
Proteins • Proteins fold in three dimensions resulting in secondary, tertiary, quaternary structures • The two most common secondary structures are the-helix and the -sheet
Secondary Structure • Only a small number of patterns are common • Patterns formed by regular intramolecular hydrogen bonding patterns
Proteins • The specific shape that a protein folds into determines its unique function • Different shapes mean the protein can bind to different molecules • Proteins are produced in a cell structure called a ribosome • Amino acids are added one after the other in the sequence coded by a messenger ribonucleic acid (mRNA) molecule
Ribosomes Large subunit Small subunit
Nucleic Acid • Two types of nucleic acids • Ribonucleic acid (RNA) • Deoxyribonucleic acid (DNA) • DNA, like protein, is a chain of simpler molecules, but double stranded • Each strand consists of a chain of nucleotides
Nucleic Acids • Each nucleotide consists of • A sugar molecule • A phosphate residue • A base • The sugar molecule has five carbon atoms labeled 1’ - 5’ • The 3’ carbon of one nucleotide is bound to the 5’ carbon of the next nucleotide in the chain giving an orientation to the chain • 5’ is the start and 3’ is the end
Nucleic Acids • The chain of sugar/phosphate groups forms the backbone of a strand of DNA • Attached to each 1’ carbon in the backbone is a molecule called a base • There are four different bases • Adenine (A) • Guanine (G) • Cytosine (C) • Thymine (T)
DNA • DNA molecules are double strands • The strands form a double helix • The strands are held in the helix form by bonds between complementary bases in the two strands • A and T are complements • G and C are complements • We refer to the paired bases as base pairs (bp) and use base pairs as the unit of length of DNA molecules
DNA • DNA can be considered as a string of letters from the set {A, T, C, G} • 5’ … TACTGAA … 3’ • This other strand connected to this one is antiparallel and complentary • 3’ … ATGACTT … 5’ • Note that the orientations of the two strands are opposite
DNA • Given one of the strands, we can infer the other strands • One of the strands can act as a template for the construction of the other • This property allows for cell division and replication with each new cell containing a copy of the DNA from the original cell • Complementary base pairs are held together by hydrogen bonds
DNA • In higher organisms, DNA is found inside the cell nucleus • Also in cell organelles called mitochondria (plants and animals) and chloroplasts (plants only) • The DNA is found in a few very long DNA molecules called chromosomes
RNA • RNA molecules are similar to DNA, but • Have a different sugar • Have the base uracil (U) instead of thymine (T) • U binds with A, as does T • RNA does not form a double helix • Hybrid DNA-RNA helices may form • Parts of an RNA molecule may bind to other parts of the same molecule by complementarity • Three-dimensional structure is variable (compare Protein)
Central Dogma of Molecular Biology • Information stored in DNA is used to make a transient RNA • Process is called transcription accomplished through use of enzyme RNA polymerase • The RNA is used to make proteins • Process is called translation and is performed by ribosomes
Genes and the Genetic Code • All of the proteins in an organism are specified by a contiguous stretch of DNA called a gene • Remember that the DNA is contained in a small number of molecules called chromosomes • Not all of the DNA specifies some protein • Some genes code for RNA products
Gene Expression • Gene expression is the process of using the information stored in DNA to make an RNA molecule and then a protein • RNA polymerases must • determine the start of genes • determine whether the protein coded by a gene is needed at the present moment • Start of gene marked by 13 nucleotides (why 13, not, e.g. 1) promoter sequence
Gene Expression • How does the RNA polymerase then tell if a protein should now be produced? • Specific regulatory genes produce proteins capable of binding to a cell’s DNA near the promoter sequence of a gene they control in some circumstances • Positive regulation when binding makes RNA polymerase initiation of transcription easier, negative regulation when harder
Genetic Code • A gene codes the sequence of amino acids needed to form a protein • 20 aa > 4 bases need more than one base to specify an aa • 43 > 20, so 3 bases suffice • Each sequence of 3 bases (a codon) codes for an amino acid (with 3 exceptions) • Three codons cause translation to end and are called stop codons
Genetic Code • Since 64 > 20, more than one codon must code for some amino acid(s) • In fact, 18 of the 20 amino acids are coded for by more than one codon • The genetic code is therefore a degenerate code • Errors in transcription may not cause the wrong aa to be produced (especially if the error is in the 3rd nucleotide) • Even if the wrong aa is produced due to a single error, a similar aa is likely to be produced
Open Reading Frames • One special start codon (AUG) marks the spot where translation begins • A sequence of codons is called a reading frame • A sequence of codons which begins with a start codon and has no stop codons is called an open reading frame (orf)
Prokaryotes and Eukaryotes • Living organisms may be classified as either prokaryote (bacteria) or eukaryote (higer organisms like yeast, plants, people) • The cells of eukaryotes have a nucleus while prokaryotes don’t • DNA is linear in eukaryotes and circular in prokaryotes
Introns and Exons • In prokaryotes, the mRNA copies of the genes corresponds directly to the DNA sequence in the genome (with U substituted for T) • In eukaryotes, the mRNA is carried outside the nucleus before translation • The mRNA is modified by splicing out sequences of introns and rejoining the exons that flank them
Introns and Exons • Splicing is controlled by enzyme complexes called spliceosomes • Incorrect splicing leads to frame shifts or premature stop codons which make the resulting protein useless • The position of introns is signalled by several specific sequences of nucleotides • Since there is more than one sequence we can have alternative splicing resulting in different proteins being produced in different circumstances.
Molecular Biology Tools • A small set of laboratory techniques are used by molecular biologists to identify the information content of organisms so that it can be processed using bioinformatics methods
Restriction Enzyme Digests • Restriction enzymes can be used to cut DNA molecules wherever a particular sequence occurs • Digesting a DNA molecule and observing how many fragments occur gives some insight into the organization and sequence of that DNA • This is called restriction mapping • Allows isolation and experimentation of individual genes for the first time
Gel Electrophoresis • We can separate the fragments of DNA obtained by restriction enzymes with gel electrophoresis • DNA fragments are pulled through a gel towards an electrical charge • Larger fragments do not move as quickly, so this provides a way to separate the fragments by size
Blotting and Hybridization • To study a single fragment, DNA is transferred from the gel to a piece of paper or cloth (blotting) • The DNA fragments are then permanently attached to the membrane using (e.g.) UV light • A specially prepared labeled fragment of DNA (a probe) is allowed to base pair with the fragments to try to find a specific fragment
Blotting and Hybridization • The probe is tagged using (e.g.) a fluorescent dye (hybridization) • Then determine where on the membrane base pairing has occurred • DNA chip or microarray techniques are similar • Thousands of nucleotide sequences are affixed to portions of a small silica chip • A large number of probes are washed over the chip and a laser is used to find which probes bind to which sequences