Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa 50613 The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences:Towards a vector space approach

The area of natural language text indexing and retrieval has been studied since the mid-50's. In text retrieval, the problem is to locate documents related to a natural language query. To this purpose, natural language text indexing programs have employed many techniques to identify terms in a document most likely to be content descriptors as opposed to terms that are poor content descriptors. By eliminating poor descriptors and pre-indexing documents by descriptors more likely to be good discriminators, the speed of selection and precision of document relevance ranking can be improved. The “vector space model”, developed by G. Salton, views the problem as an n-dimensional hyperspace in which documents and queries.

In text retrieval, the problem is to locate documents related to a natural language query. Natural language text indexing programs identify terms in a document most likely to be content descriptors. The goal of these experiments is to apply text indexing techniques to genomic data bases. Overview

Natural language text indexing and retrieval has been studied since the mid-50's. In text retrieval, the problem is to locate documents related to a natural language query. Natural language text indexing programs employ techniques to identify terms in a document most likely to be content descriptors. By eliminating poor descriptors and pre-indexing documents by descriptors likely to be good discriminators, the speed of selection and precision of document relevance ranking can be improved. The “vector space model”, developed by G. Salton, views the problem as an n-dimensional hyperspace of documents and queries. Natural Language Indexing

Document Hyperspace

Hyperspace Queries

Clustering Objects by Feature

Cosine Similarity Coefficient

EMBL (http://www.embl.org) SWISS-PROT (http://www.expasy.org/sprot/sprot-top.html) PROSITE (http://www.expasy.org/prosite/) PIR (http://pir.georgetown.edu/home.shtml) NCBI/NLM GenBank (http://www.ncbi.nih.gov/) MGD: The Mouse Genome Database (http://www.informatics.jax.org/) OMIM - Online Mendelian Inheritance in Man (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) Genomic Data Bases

NCBI “nt” data base: ~12 billion bytes in length comprising 2,584,440 sequences in FASTA format (Sept 2004). Example sequence: > gi|2695852|emb|Y13263.1|ABY13263 Acipenser baeri mRNA for immunoglobulin heavy chain, clone CAAGAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTATAATAATGACAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAACTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCAACAACATGGGCTGGGTTCGACAAGCTCCTGGAAAGGGTCTGGAATGGGTGTCTACTATAAGCTATAGTGTAAATGCATACTATGCCCAGTCTGTCCAGGGAAGATTCACCATCTCCAGAGACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACTCTGCCGTGTATTACTGTGCTCGAGAGTCTAACTTCAACCGCTTTGACTACTGGGGATCCGGGACTATGGTGACCGTAACAAATGCTACGCCATCACCACCGACAGTGTTTCCGCTTATGCAGGCATGTTGTTCGGTCGATGTCACGGGTCCTAGCGCTACGGGCTGCTTAGCAACCGAATTC “nt” Sequence Data Base

JOURNAL Submitted (31-OCT-1997) Structural Biology, Stanford University, Fairchild Building Campus West Dr. Room D-100, Stanford, CA 94305-5126, USA FEATURES Location/Qualifiers source 1..289 /organism="Aotus azarai" /mol_type="genomic DNA" /db_xref="taxon:30591" sig_peptide 134..193 exon <134..200 /number=1 intron 201..>289 /number=1 ORIGIN 1 gtccccgcgg gccttgtcct gattggctgt ccctgcgggc cttgtcctga ttggctgtgc 61 ccgactccgt ataacataaa tagaggcgtc gagtcgcgcg ggcattactg cagcggacta 121 cacttgggtc gagatggctc gcttcgtggt ggtggccctg ctcgtgctac tctctctgtc 181 tggcctggag gctatccagc gtaagtctct cctcccgtcc ggcgctggtc cttcccctcc GenBank • LOCUS AAB2MCG1 289 bp DNA linear PRI 23-AUG-2002 • DEFINITION Aotus azarai beta-2-microglobulin precursor exon 1. • ACCESSION AF032092 • VERSION AF032092.1 GI:3265027 • KEYWORDS . • SEGMENT 1 of 2 • SOURCE Aotus azarai (Azara's night monkey) • ORGANISM Aotus azarai • Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; • Mammalia; Eutheria; Primates; Platyrrhini; Cebidae; Aotinae; Aotus. • REFERENCE 1 (bases 1 to 289) • AUTHORS Canavez,F.C., Ladasky,J.J., Muniz,J.A., Seuanez,H.N., Parham,P. and • Cavanez,C. • TITLE beta2-Microglobulin in neotropical primates (Platyrrhini) • JOURNAL Immunogenetics 48 (2), 133-140 (1998) • MEDLINE 98298008 • PUBMED 9634477 • REFERENCE 2 (bases 1 to 289) • AUTHORS Canavez,F.C., Ladasky,J.J., Seuanez,H.N. and Parham,P. • TITLE Direct Submission

Currrent access to sequence databases mainly by heuristic-assisted pattern matching on flat or nearly flat files using programs such as BLAST and FASTA. Underlying data bases growing rapidly with consequent deterioration of search times even on large, multiprocessor systems as current software tools reach design limits. BLAST systems index data base sequences according to short code letter words (usually, 3 letters for amino acids and 11 for nucleotide data bases); scoring matrices. Queries also decomposed to similar short code words. The data base is scanned & sequences with words in common with the query are processed to extend the initial code word match. Sequence Matching

Example BLAST Output • Score E • Sequences producing significant alignments: (bits) Value • emb|BX015832.1|CNS08KDO Single read from an extremity of a ... 918 0.0 • emb|BX032891.1|CNS08XJJ Single read from an extremity of a ... 902 0.0 • emb|BX065445.1|CNS09MNT Single read from an extremity of a ... 894 0.0 • emb|BX052703.1|CNS09CTV Single read from an extremity of a ... 894 0.0 • emb|BX030708.1|CNS08VUW Single read from an extremity of a ... 894 0.0 • emb|BX030663.1|CNS08VTN Single read from an extremity of a ... 894 0.0 • ............................................................................. • >emb|BX015832.1|CNS08KDO Single read from an extremity of a full-length cDNA clone made from Anopheles gambiae total adult females. 3-PRIME end of clone FK0AAA23DA12 • Length = 866 • Score = 918 bits (463), Expect = 0.0 • Identities = 535/559 (95%) • Strand = Plus / Plus • Query: 1 tctttactattattggggaatttcgaggaacatttgttccccttacaatgcatttctata 60 • |||||||||||||||| |||||||| ||||||||||||||||||||||||||||||||| • Sbjct: 1 tctttactattattggtcaatttcgatgaacatttgttccccttacaatgcatttctata 60 • Query: 61 acctacacctggagtaggtggttccggttcagccacttcagtgggaggaacttccgtttc 120 • | |||||||||||||||||||||||||||||||||||||||||||||| ||||||||||| • Sbjct: 61 aactacacctggagtaggtggttccggttcagccacttcagtgggaggcacttccgtttc 120

This work attempts to explore natural language indexing techniques applied genomic data bases through: Weight based indexing of k-tuples derived from NCBI “nt” sequence data base. Text terms used in genomic sequence data banks and literature; Both applications are implemented for Linux and written in Mumps and MDH, a Mumps related C++ toolkit capable of indexing data sets of up to 256 terabytes using a B-tree based multidimensional data model, that includes many retrieval and sequence matching functions. Developing A Vector Space Approach to Sequence Indexing

The IDF weight yields higher values for words whose distribution is more concentrated and lower values for words whose use is more widespread. Thus, words of broad context are weighted lower than words of narrow context. Words of low weight are hypothesized to be poor indexing terms while words with high weights are hypothesized to be good indexing terms. The bulk of the words, as is the case in natural language text, reside in the middle range. Inverse Document Frequency Wgt.

Word Freq(i,j) TotFreq DocFreq Wgt1 Wgt2 Wgt3 MCA [1] Death of a cult. (Apple Computer needs to alter its strategy) (column) apple 4 261 112 1.716 9.757 17 -1.1625 computer 4 706 358 2.028 5.109 10 -19.4405 mac 2 146 71 0.973 6.290 6 -0.0256 macintosh 4 210 107 2.038 9.940 20 -0.5855 strategy 2 79 67 1.696 6.406 11 -0.0592 [3] WordPerfect. (WordPerfect for the Macintosh 2.0) (evaluation) Taub, Eric. edit 2 111 77 1.387 6.128 8 -0.0961 frame 2 9 7 1.556 10.924 17 0.0131 import 2 29 19 1.310 8.927 12 0.0998 macintosh 3 210 107 1.529 7.705 12 -0.5855 macro 3 38 24 1.895 12.189 23 0.1075 outstand 1 10 9 0.900 5.711 5 0.0168 user 4 861 435 2.021 4.330 9 -26.8094 wordperfect 8 24 8 2.667 39.627 106 0.1747 [4] Radius Pivot for Built-In Video an Radius Color Pivot. (Hardware Review) (new Mac monitors)(includes related article on design of built-in 3 35 29 2.486 11.621 29 0.0678 color 3 81 47 1.741 10.173 18 0.0809 mac 2 146 71 0.973 6.290 6 -0.0256 monitor 6 88 52 3.545 18.739 66 0.0946 resolution 2 50 32 1.280 7.884 10 0.0288 screen 2 92 62 1.348 6.561 9 0.0199 video 4 106 61 2.302 12.188 28 0.0187 Natural Language Example

Sequences from the NCBI "nt" (non-redundant nucleotide) data base were used. The “nt” data base is approximately ~12 billion bytes in length comprising 2,584,440 sequences in FASTA format (Sept 2004). A “word” size of 11 was used throughout. A total of 4,194,299 words were identified, slightly less than the theoretical maximum of 4,194,304. Indexing Experiment

The overall frequencies of occurrence of all possible 11 character words from each sequence were determined along with the number of sequences in which each unique word was found. A weight Wgtifor each word i was calculated by taking the Log10, multiplied by 10 and truncated to the nearest integer, of the total number of sequences (N) divided by the number of sequences in which the word occurred (DocFreqi). Wgti= (int) 10 * Log10 ( N / DocFreqi ) In natural language indexing, this is referred to as the inverse document frequency (IDF) weight. Calculating the IDF Weight

Initial file analysis produces about 110 intermediate files of about 440 million bytes each from the input data base (12 GB). out.table is a large (40 billion byte) word-sequence file. freq.bin contains the inverse document frequency weight for each word (53 million bytes); index (76 million bytes) gives for each word the eight byte offset of the word's entry in out.table. index and freq.bin are merged into ITABLE (112 million bytes) which contains for each word its weight, offset, and a pointer to a list of aliases (not used with the “nt” data base). File Sizes

W = ( w1, w2, w3, ... wM) vector of M weights F = ( f1,1 f1,2 f1,3 ... f1,N ) ( f2,1 f2,2 f2,3 ... f2,N ) ( f3,1 f3,2 f3,3 ... f3,N ) ... word-sequence matrix ( fM,1 fM,2 fM,3 ... fM,N ) Data Base

for i = 1 to 120 zi← 0 for j = 1 to M if wj = i then zi ← zi + 1 Number of Words at each Weight

Number of Words at Each IDF Wgt.

for i = 1 to 120 // for each weight xi← 0 for j = 1 to M // for each word for k = 1 to N // for each sequence if fj, k = i then xi ← xi + 1 Sum of all Instances of Each Weight

Number of Occurrences at Each IDF Level

For retrieval, a query sequence is read and decomposed into 11 character words. These words are reduced to a numeric equivalent which is used as an index into the word-sequence table. Entries in a master vector corresponding to sequences are incremented by the weight of the word if the word occurs in the sequence if the weight of the word lies within a specified range. When all words have been processed, entries in the master sequence vector are normalized according to the length of the underlying sequence in respect to the length of the query. Finally, the master sequence vector is sorted and the top scoring entries printed or submitted to a Smith-Waterman alignment, sorted and then printed. Optionally, the Smith-Waterman alignments themselves can be printed and the selected sequences can be extracted from the nt data base and stored in a separate output file for additional processing. FASTA post-processing is an option. Sequence Retrieval

Unweighted Result for 500 Random Queries

Result for 500 Random Queries Weight Range 65-120

Overall Results for 500 Random Queries

Query: >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71 Query string has 289 letters Searching ... 68224 >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 31420 >gi|29467317|dbj|AB089555.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol 30508 >gi|19911912|dbj|AB072084.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 30296 >gi|29467668|dbj|AB100815.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol 29800 >gi|14150634|gb|AF369255.1| Hepatitis C virus Pt.2F NS3 protease gene, partial cds 29444 >gi|19911960|dbj|AB072108.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 29240 >gi|19911888|dbj|AB072072.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 29196 >gi|14150646|gb|AF369261.1| Hepatitis C virus Pt.6A NS3 protease gene, partial cds 29120 >gi|19911862|dbj|AB072059.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 28896 >gi|14150628|gb|AF369252.1| Hepatitis C virus Pt.128 NS3 protease gene, partial cds 28116 >gi|2731651|gb|U81612.1|HCU81612 Hepatitis C virus polyprotein gene, partial cds 28116 >gi|3157741|dbj|AB013621.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region), 27700 >gi|14150620|gb|AF369248.1| Hepatitis C virus Pt.1 NS3 protease gene, partial cds ............................................................................................. Total fetch time used: 13 Total number of accessions found: 501 Total primary query word count: 279 Total alias count: 0 Total number of sequences searched: 2584440 Max Indx 1300023 Index Scoring Results

Query: >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71 Query string has 289 letters top= >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71 166 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCAAATCACCCAGATGTACACCAATGTAGACCAGGACCT 245 ::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::: ::: 1 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCAAGTCACCCAGATGTACACCAATGTAGACCAGGTCCT 80 246 CGTCGGCTGGCCGGCGCCCCCCGGAGCGCGTTCCTTGACACCATGCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 325 :::::::::::::::::: ::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::: 81 CGTCGGCTGGCCGGCGCCGCCCGGAGCGCGTTCCTTGAGACCATGCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 160 326 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGGGGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 405 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 161 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGGGGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 240 406 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGCTGTGG 454 ::::::::::::::::::::::::::::::::::::::::::::::::: 241 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGCTGTGG 289 score=566 Smith-Waterman Result Scoring

566 >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 505 >gi|29467668|dbj|AB100815.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol 504 >gi|29467317|dbj|AB089555.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol 503 >gi|19911914|dbj|AB072085.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 503 >gi|14150628|gb|AF369252.1| Hepatitis C virus Pt.128 NS3 protease gene, partial cds 502 >gi|29467247|dbj|AB089520.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol 502 >gi|3157741|dbj|AB013621.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region), 501 >gi|19911862|dbj|AB072059.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 499 >gi|29467670|dbj|AB100816.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol 498 >gi|3157753|dbj|AB013627.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region), 498 >gi|14150634|gb|AF369255.1| Hepatitis C virus Pt.2F NS3 protease gene, partial cds 497 >gi|29467311|dbj|AB089552.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol 497 >gi|19911934|dbj|AB072095.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 497 >gi|19911912|dbj|AB072084.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 496 >gi|19911900|dbj|AB072078.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 495 >gi|14150638|gb|AF369257.1| Hepatitis C virus Pt.3O NS3 protease gene, partial cds 495 >gi|14150616|gb|AF369246.1| Hepatitis C virus Pt.1A NS3 protease gene, partial cds 495 >gi|14150646|gb|AF369261.1| Hepatitis C virus Pt.6A NS3 protease gene, partial cds 494 >gi|14150620|gb|AF369248.1| Hepatitis C virus Pt.1 NS3 protease gene, partial cds 494 >gi|19911888|dbj|AB072072.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p 493 >gi|19911960|dbj|AB072108.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p ... Total fetch time used: 26 Total number of accessions found: 31 Total primary query word count: 279 Total alias count: 0 Total number of sequences searched: 2584440 Max Indx 1300023 S-W Scores

On larger query sequences (5,000 to 6,000 letters), the IDF method performed slightly better than BLAST. On 25 sequences randomly generated, the IDF method correctly ranked the original sequence first 24 times and once at rank 3. BLAST, on the other hand, ranked the original sequence first 21 times while the remaining 4 were ranked 2, 2, 3 and 4. Average time per query for the IDF method was 47.4 seconds and the average time for BLAST was 122.8 seconds. Larger Sequences

Future work Weighted Term Vectors. Other weighting schemes such as the Modified Centroid Algorithm. Sequence-Sequence and Term-Term Correlations. Sequence clustering. The Next Step

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J. Mol. Biol. 215:403-10. O'Kane, K.C.; and Lockner, M. J. (2004) Indexing genomic sequence libraries, Information Processing and Management, 41:265-274. O'Kane, K.C. (2004) The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval, submitted. Pearson, W. R. (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132:185-219. Salton, G. (1983), Introduction to Modern Information Retrieval, McGraw-Hill (New York 1983). Smith, T.F. & Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147:195-197 References

Hierarchical Data Base

Sloan Report on Bioinformatics from June 2004. Number of graduates * There were only 26 new PhD's produced... * 102 masters degrees awarded... * Only 17 Bachelor's degrees produced... * The data is for January 2002 until March 2003. "... in the next few years the number of graduates is expected to increase by two or three times." Average program enrollment: * 103 = Bachelors * 435 = Masters * 296 = Phd Bioinformatics

Mathematics: 800:060; 800:061; 800:064; 800:152; 800:164 (17 hours) Computer Science: 810:061; 810:062; 810:065; 810:066; 810:080; 810:114; 810:115; 810:180 (24 hours) Biology: 840:051; 840:052; 840:130; 840:140; 840:153 (19 hours) Chemistry: 860:070* or both 860:044 and 860:048; 860:063 (9-12 hours) Elective: One course from the following (3 hours) Computer Science: 810:143; 810:147; 810:153; 810:155; 810:161; 810:172; 810:181 Total 73-75 hours B.S. in Bioinformtics at UNI

800:060. Calculus I . The derivatives and integrals of elementary functions and their applications. 800:061. Calculus II. Continuation of 800:060 800:064. Elementary Probability and Statistics for Bioinformatics. Descriptive statistics, basic probability concepts, confidence intervals, hypothesis testing, correlation and regression, elementary concepts of survival analysis 800:152. Introduction to Probability. Axioms of probability, sample spaces having equally likely outcomes, conditional probability and independence, random variables, expectation, moment generating functions, jointly distributed random variables, weak law of large numbers, central limit theorem Courses

800:164. Statistical Methods in Bioinformatics. Analysis of a DNA sequence, analysis of multiple DNA and protein sequences, BLAST. 810:061. Computer Science I. Introduction to computer programming in the context of a modern object-oriented programming language. Emphasis on good programming techniques, object-oriented design, and style through extensive practice in designing, coding, and debugging programs. 810:062. Computer Science II. Intermediate programming in an object-oriented environment. Topics include object-oriented design, implementation of classes and methods, dynamic polymorphism, frameworks, patterns, software reuses, limitations, exceptions, and threads. Courses

810:065. Computing for Bioinformatics I. Intermediate programming with emphasis on bioinformatics. Includes file handling, memory management, multi-threading, B-trees, introduction to dynamic programming including Wunsch-Neddleman and Smith-Waterman algorithms for optimal alignments, exploration of BLAST, FASTA and gapped alignment, substitution matrices. 810:066. Computing for Bioinformatics II. Advanced bioinformatics computing: Perl and CGI programming; data base facilities for bioinformatics; pattern matching with regular expressions; advanced dynamic programming: optimal versus local alignment, multiple alignments; data base mining tools, Entrez, SRS, BLAST, FASTA, CLUSTAL; graphical 3-D representation of proteins; phylogenic trees. Courses

810:080. Discrete Structures. Topics include propositional and first-order logic; proofs and inference; mathematical induction; sets, relations, and functions; and graphs, lattices, and Boolean algebra, all in the context of computer science. 810:114. Database Systems. Storage of, and access to, physical databases; data models, query languages, transaction processing, and recovery techniques; object-oriented and distributed database systems; and database design. 810:115. Information Storage and Retrieval. Natural language processing; analysis of textual material by statistical, syntactic, and logical methods; retrieval systems models, dictionary construction, query processing, file structures, content analysis; automatic retrieval systems and question-answering systems; and evaluation of retrieval effectiveness. Courses

810:180. Undergraduate Research in Computer Science 840:051. General Biology: Organismal Diversity. Study of organismic biology emphasizing evolutionary patterns and diversity of organisms and interdependency of structure and function in living systems. 840:052. General Biology: Cell Structure and Function. Study of cells, genetics, and DNA technology emphasizing the chemical basis of life and flow of information. 840:130. Molecular Biology of the Cell. Introduction to the molecular, biochemical, and cellular structure and function of cells, DNA structure and functions, and the translation of genetic information into functional structures of living cells. DNA replication, transcription of genes, and synthesis and processing of proteins will be emphasized. Courses

840:140. Genetics. Analytical approach to classical, molecular, and population genetics 840:153. Recombinant DNA Techniques. Study of techniques for manipulating and analyzing DNA, including genomic library construction, polymerase chain reaction, oligonucleotide synthesis, genomic analysis with computers, and DNA and RNA isolation. 860:070. General Chemistry I-II. Accelerated course for well-prepared students. Content similar to 860:044 and 860:048 but covered in one semester. Completion satisfies General Chemistry requirement of any chemistry major. Courses

860:063. Applied Organic and Biochemistry. Basic concepts in organic chemistry and biochemistry, including nomenclature, functional groups, reactivity, and macromolecules. Elective from: 810:143(g). Operating Systems. History and evolution of operating systems; process and processor management; primary and auxiliary storage management; performance evaluation, security, and distributed systems issues; and case studies of modern operating systems. Courses

810:147. Networking. Network architectures and communication protocol standards. Topics include communication of digital data, data-link protocols, local-area networks, network-layer protocols, transport-layer protocols, applications, network security, and management. 810:153. Design and Analysis of Algorithms. Algorithm design techniques such as dynamic programming and greedy algorithms; complexity analysis of algorithms; efficient algorithms for classical problems; intractable problems and techniques for addressing them; and algorithms for parallel machines. 810:155. Translation of Programming Languages. Introduction to analysis of programming languages and construction of translators. Courses

810:161. Artificial Intelligence. Models of intelligent behavior and problem solving; knowledge representation and search methods; learning; topics such as knowledge-based systems, language understanding, and vision; optional 1-hour lab in symbolic programming techniques: heuristic programming; symbolic representations and algorithms; and applications to search, parsing, and high-level problem-solving tasks. 810:172. Software Engineering. Study of software life cycle models and their phases--planning, requirements, specifications, design, implementation, testing, and maintenance. Emphasis on tools, documentation, and applications. 810:181. Theory of Computation. Topics include regular languages and grammars; finite state automata; context-free languages and grammars; language recognition and parsing; and turing computability and undecidability. Courses

Kevin C. O'Kane Department of Computer Science The University of Northern Iowa