Introductory Biological Sequence Analysis Through Spreadsheets
Introductory Biological Sequence Analysis Through Spreadsheets. Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee, WI. Teaching Mathematics to Students of Biology.
Introductory Biological Sequence Analysis Through Spreadsheets
E N D
Presentation Transcript
Introductory Biological Sequence Analysis Through Spreadsheets Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee, WI ICTCM 2000
Teaching Mathematics to Students of Biology • Need to make the math in the courses correlate with math that needed in that discipline • The most important “math” needed is statistics • The molecular biology revolution in biology presents data in a form in which calculus has little impact (sequences of letters) ICTCM 2000
The Nature of Biological Sequence Data • Primary structure of DNA, RNA, and proteins are sequences of letters -- 4 letters in the case of DNA (ATGC) and RNA (AUGC) and 20 letters representing the sequence of amino acids which makes up a protein • Secondary and Tertiary structures (bending, folding and twisting) of structures determines function -- hints seen through primary structure ICTCM 2000
Use of Spreadsheets in this setting • Commonly found and used in biological labs for data acquisition, storage and organization, and data analysis • Commonly present on student computers and computer labs • Unlike calculators -- able to handle data sets typical of “real world” applications • R.F. Murphy at CMU has developed a set of worksheets for sequence analysis ICTCM 2000
Meaningful Questions & Problems 1. Measuring the similarity between two strings -- “alignment” or “homology” 2. Finding instances of a pattern in a string 3. Describing the composition and properties of a string 4. Graphing the evolutionary process and construction of phylogenetic trees ICTCM 2000
Measuring the Similarity between Strings • Given a gene -- suggest the function of the protein coded for by finding a similar sequence (possibly in another species) • Simple homology involves assigning a “1” for agreement and “0” for nonagreement at each site. Then sum over all sites • Homology is the fraction of the highest possible score, in % ICTCM 2000
Spreadsheet #1 Simple Homology ICTCM 2000
Finding Instances of a Particular Pattern in a String • The process of locating genes involves locating regions of the DNA sequences that contain patterns which resemble those of known genes • Identifying sites on DNA where one of the restriction enzymes can cleave DNA -- Also of interest is size of the fragments that result • Identify regions of RNA which correspond to particular features (e.g. loops) which may be splice sites ICTCM 2000
Describing the Composition and Properties of a String • Counts of frequencies of particular letters due to their properties (e.g. regions rich in G&C or A&T in DNA) • Properties of proteins (e.g. charge or hydrophobicity) which depend on the nature and frequencies of the particular amino acids ICTCM 2000
Spreadsheet #2 Hydropathy Plot ICTCM 2000
Spreadsheet #2 (Cont.) ICTCM 2000
Graphing Evolution and Phylogenetic Trees • Evolutionary distance between two DNA sequences used to determine the process of the changes in the sequences over time (e.g. the evolution of HIV or the flu viruses) • Trees constructed to express the relationship between related sequences -- distance in the tree a monotone function of homology ICTCM 2000
Spreadsheet #3 Mutation & Evolution ICTCM 2000
Spreadsheet #3 (cont.) To study the evolution of a sequence, we randomly pick a site for mutation, then change its letter ICTCM 2000
Conclusion • Use of a spreadsheet makes possible an experimental approach to introducing the mathematics of sequence analysis • The use of spreadsheets makes possible the use of real-world data and presents the computational tool in a meaningful context • The importance of the topics to all educated individuals suggests that the topics be included in many liberal arts math courses ICTCM 2000