BioInformatics Databases

Special Topics BSC4933/5936:An Introduction to Bioinformatics.Florida State UniversityThe Department of Biological Sciencewww.bio.fsu.edu

BioInformatics Databases Steven M. Thompson Florida State University School of Computational Science (SCS)

So many Databases ???? NCBI’s Entrez

But first some of my definitions, lots of overlap — • Biocomputing and computational biology are synonyms and describe the use of computers and computational techniques to analyze any type of a biological system, from individual molecules to organisms to overall ecology. • Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any type of biological database. • Sequence analysis is the study of molecular sequence data for the purpose of inferring the function, interactions, evolution, and perhaps structure of biological molecules. • Genomics analyzes the context of genes or complete genomes (the total DNA content of an organism) within the same and/or across different genomes. • Proteomics is the subdivision of genomics concerned with analyzing the complete protein complement, i.e. the proteome, of organisms, both within and between different organisms.

One way to think about the field — • The Reverse Biochemistry Analogy. • Biochemists no longer have to begin a research project by isolating and purifying massive amounts of a protein from its native organism in order to characterize a particular gene product. Rather, now scientists can amplify a section of some genome based on its similarity to other genomes, sequence that piece of DNA and, using sequence analysis tools, infer all sorts of functional, evolutionary, and, perhaps, structural insight into that stretch of DNA! • The computer and molecular databases are a necessary, integral part of this entire process.

The exponential growth of molecular sequence databases & cpu power — • Year BasePairs Sequences • 1982 680338 606 • 1983 2274029 2427 • 1984 3368765 4175 • 1985 5204420 5700 • 1986 9615371 9978 • 1987 15514776 14584 • 1988 23800000 20579 • 1989 34762585 28791 • 1990 49179285 39533 • 1991 71947426 55627 • 1992 101008486 78608 • 1993 157152442 143492 • 1994 217102462 215273 • 1995 384939485 555694 • 1996 651972984 1021211 • 1997 1160300687 1765847 • 1998 2008761784 2837897 • 1999 3841163011 4864570 • 2000 11101066288 10106023 • 2001 15849921438 14976310 • 2002 28507990166 22318883 • 2003 36553368485 30968418 doubling time ~ one year http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Database Growth (cont.) — The Human Genome Project and numerous smaller genome projects have kept the data coming at alarming rates. As of December 2004, almost 240 complete genomes are publicly available for analysis, not counting all the virus and viroid genomes available. The International Human Genome Sequencing Consortium announced the completion of the "Working Draft" of the human genome in June 2000; Independently that same month, the private company Celera Genomics announced that it had completed the first “Assembly” of the human genome. Both articles were published mid-February 2001 in the journals Science and Nature.

Some neat stuff from the papers — We, Homo sapiens, aren’t nearly as special as we had hoped we were. Of the 3.2 billion base pairs in our DNA: Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25’ and 35,000! The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping’ ‘selfish DNA’ of which much may be involved in regulation and control. Over 100-200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome! (Later shown to be not true by more extensive analyses, and to be due to gene loss rather than transfer.)

What are sequence databases? • These databases are an organized way to store the tremendous amount of sequence information accumulating worldwide. Most have their own specific format. An ‘alphabet soup’ of three major database organizations around the world are responsible for maintaining most of this data. They largely ‘mirror’ one another and share accession codes, but NOT proper identifier names: • North America: the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), at the National Institute of Health (NIH), has GenBank & GenPept. Also Georgetown University’s National Biomedical Research Foundation (NBRF) Protein Identification Resource (PIR) & NRL_3D (Naval Research Lab sequences of known three-dimensional structure). • Europe: the European Molecular Biology Laboratory (EMBL), the European Bioinformatics Institute (EBI), and the Swiss Institute of Bioinformatics’ (SIB) Expert Protein Analysis System (ExPasy), all help maintain theEMBL Nucleotide Sequence Database, and the SWISS-PROT & TrEMBL amino acid sequence databases. • Asia: The National Institute of Genetics (NIG) supports the Center for Information Biology’s (CIG) DNA Data Bank of Japan (DDBJ).

A little history — • Developments that affect software and the end user — • The first well recognized sequence database was Dr. Margaret Dayhoff’s hardbound Atlas of Protein Sequence and Structure begun in the mid-sixties. DDBJ began in 1984, GenBank in 1982, and EMBL in 1980. They are all attempts at establishing an organized, reliable, comprehensive and openly available library of genetic sequences. Databases have long-since outgrown a hardbound atlas. They have become huge and have evolved through many changes with many more yet to come. • Changes in format over the years are a major source of grief for software designers and program users. Each program needs to be able to recognize particular aspects of the sequence files; whenever they change it throws a wrench in the works. NCBI’s ASN.1 format and its Entrez interface attempt to circumvent some of these frustrations. However, database format is much debated as many bioinformaticians argue for relational or object-oriented standards. Unfortunately, until all biologists and computer scientists worldwide agree on one standard and all software is (re)written to that standard, neither of which is likely to happen very quickly, format issues will remain probably the most confusing and troubling aspect of working with primary sequence data.

So what are these databases like? Just what are primary sequences? (Central Dogma: DNA —> RNA —> protein) Primary refers to one dimension — all of the ‘symbol’ information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide. The symbols are the one letter codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes. Biological carbohydrates, lipids, and structural and functional information are not sequence data. Not even DNA translations in a DNA database! However, much of this feature and bibliographic type information is available in the reference documentation sections associated with primary sequences in the databases.

Content & Organization — Sequence database installations are commonly a complex ASCII/Binary mix, usually not relational or Object Oriented (but proprietary ones often are). They’ll contain several very long text files each containing different types of information all related to particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ all of these other files by providing indexing functions. Software is usually required to successfully interact with these databases and access is most easily handled through various software packages and interfaces, either on the World Wide Web or otherwise.

Nucleic Acid DB’s GenBank/EMBL/DDBJ all Taxonomic categories + HTC’s, HTG’s, & STS’s “Tags” EST’s GSS’s Amino Acid DB’s SWISS-PROT TrEMBL PIR PIR1 PIR2 PIR3 PIR4 NRL_3D Genpept More organization stuff — Nucleic acid sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical rankings — the Fungi warning!). PIR is split into subdivisions based on level of annotation. TrEMBL sequences are merged into SWISS-PROT as they receive increased levels of annotation.

Parts and problems — • All sequence databases contain these elements: • Name: LOCUS, ENTRY, ID all are unique identifiers • Definition: A brief, one-line, textual sequence description. • Accession Number: A constant data identifier. • Source and taxonomy information. • Complete literature references. • Comments and keywords. • The all important FEATURE table! • A summary or checksum line. • The sequence itself. • But: • Each major database as well as each major suite of software tools that you are likely to use has its own distinct format requirements. This can be a huge problem and an enormous time sink, even with helpful tools such as Don Gilbert’s ReadSeq. Therefore, becoming familiar with some of the common formats is a big help. Look for key features of each type of entry:

GenBank and GenPept format — • LOCUS HSEF1AR 1506 bp mRNA linear PRI 12-SEP-1993 • DEFINITION Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha). • ACCESSION X03558 • VERSION X03558.1 GI:31097 • KEYWORDS elongation factor; elongation factor 1. • SOURCE human. • ORGANISM Homo sapiens • Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; • Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. • REFERENCE 1 (bases 1 to 1506) • AUTHORS Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W. • TITLE The primary structure of the alpha subunit of human elongation…… • JOURNAL Eur. J. Biochem. 155 (1), 167-171 (1986) • MEDLINE 86136120 • FEATURES Location/Qualifiers • source 1..1506 • /organism="Homo sapiens" • /db_xref="taxon:9606" • CDS 54..1442 • /note="EF-1 alpha (aa 1-463)" • /codon_start=1 • /protein_id="CAA27245.1" • /db_xref="GI:31098" • /db_xref="SWISS-PROT:P04720" • /translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK • EAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNM • ……VTKSAQKAQKAK" • BASE COUNT 412 a 337 c 387 g 370 t • ORIGIN • 1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa • 61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca………. • 1501 aactgt • // Look for “LOCUS,” “FEATURES,” “ORIGIN,” the sequence itself, and then “//.”

EMBL and SWISS-PROT format — • ID EF11_HUMAN STANDARD; PRT; 462 AA. • AC P04720; P04719; • DT 13-AUG-1987 (Rel. 05, Created)…… • DE Elongation factor 1-alpha 1 (EF-1-alpha-1) (Elongation factor 1 A-1) • DE (eEF1A-1) (Elongation factor Tu) (EF-Tu). • GN EEF1A1 OR EEF1A OR EF1A. • OS Homo sapiens (Human), • OS Bos taurus (Bovine), and • OS Oryctolagus cuniculus (Rabbit). • OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; • OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. • OX NCBI_TaxID=9606, 9913, 9986; • RN [1] • RP SEQUENCE FROM N.A. • RC SPECIES=Human; • RX MEDLINE=86136120; PubMed=3512269; • RA Brands J.H.G.M., Maassen J.A., van Hemert F.J., Amons R., Moeller W.; • RT "The primary structure of the alpha subunit of human elongation …. -binding sites."; • RL Eur. J. Biochem. 155:167-171(1986).…… • CC -!- FUNCTION: THIS PROTEIN PROMOTES THE GTP-DEPENDENT BINDING OF • CC AMINOACYL-TRNA TO THE A-SITE OF RIBOSOMES DURING PROTEIN • CC BIOSYNTHESIS. • CC -!- SUBCELLULAR LOCATION: Cytoplasmic. • CC -!- TISSUE SPECIFICITY: BRAIN, PLACENTA, LUNG, LIVER, KIDNEY, • CC PANCREAS BUT BARELY DETECTABLE IN HEART AND SKELETAL MUSCLE. • CC -!- SIMILARITY: BELONGS TO THE GTP-BINDING ELONGATION FACTOR FAMILY. • CC EF-TU/EF-1A SUBFAMILY…… • DR EMBL; X03558; CAA27245.1; -…… • DR PIR; S18054; EFRB1…… • DR HSSP; Q01698; 1TUI…… • DR InterPro; IPR004160; GTP_EFTU_D3. • DR Pfam; PF00009; GTP_EFTU; 1…… • DR PROSITE; PS00301; EFACTOR_GTP; 1. • KW Elongation factor; Protein biosynthesis; GTP-binding; Methylation; • KW Multigene family. • FT NP_BIND 14 21 GTP (BY SIMILARITY). • FT NP_BIND 91 95 GTP (BY SIMILARITY). • FT NP_BIND 153 156 GTP (BY SIMILARITY). • FT MOD_RES 36 36 METHYLATION (TRI-). • FT MOD_RES 55 55 METHYLATION (DI-). • FT MOD_RES 79 79 METHYLATION (TRI-). • FT MOD_RES 165 165 METHYLATION (DI-). • FT MOD_RES 318 318 METHYLATION (TRI-). • FT BINDING 301 301 ETHANOLAMINE-PHOSPHOGLYCEROL. • FT BINDING 374 374 ETHANOLAMINE-PHOSPHOGLYCEROL. • FT CONFLICT 83 83 S -> A (IN REF. 2). • FT CONFLICT 232 232 L -> V (IN REF. 3). • SQ SEQUENCE 462 AA; 50141 MW; D465615545AF686A CRC64; • MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL • DKLKAERERG …… VTKSAQKAQK AK • // Look for “ID,” “FT,” “SQ,” the sequence, and then “//.”

Look for “ENTRY” and “SEQUENCE” with numbers for CODATA; “>P1;” name, then definition line, then sequence, then annotation “C;” for NBRF protein format. PIR CODATA and NBRF formats — • ENTRY EFHU1 #type complete iProClass View of EFHU1 • TITLE translation elongation factor eEF-1 alpha-1 chain - human • (Annotation abrideged here) • FEATURE • 1-223 #domain eEF-1 alpha domain I, GTP-binding #status • predicted #label EF1\ • 8-156 #domain translation elongation factor Tu homology • #label ETU\ • 14-21 #region nucleotide-binding motif A (P-loop)\ • 153-156 #region GTP-binding NKXD motif\ • 245-330 #domain eEF-1 alpha domain II, tRNA-binding • #status predicted #label EF2\ • 332-462 #domain eEF-1 alpha domain III, tRNA-binding • #status predicted #label EF3\ • 36,55,79,165,318 #modified_site N6,N6,N6-trimethyllysine (Lys) • #status predicted\ • 301,374 #binding_site glycerylphosphorylethanolamine • (Glu) (covalent) #status predicted • SUMMARY #length 462 #molecular_weight 50141 • SEQUENCE • 5 10 15 20 25 30 • 1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K • 31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L • 61 D K L K A E R E R …... Q K A Q K A K • >P1;EFHU1 • pir1:efhu1 => EFHU1 • MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG • KGSFKYAWVL DKLKAERERG ITIDISLWKF ETSKYYVTII DAPGHRDFIK • NMITGTSQAD CAVLIVAAGV GEFEAGISKN GQTREHALLA YTLGVKQLIV • GVNKMDSTEP PYSQKRYEEI VKEVSTYIKK IGYNPDTVAF VPISGWNGDN • MLEPSANMPW FKGWKVTRKD GNASGTTLLE ALDCILPPTR PTDKPLRLPL • QDVYKIGGIG TVPVGRVETG VLKPGMVVTF APVNVTTEVK SVEMHHEALS • EALPGDNVGF NVKNVSVKDV RRGNVAGDSK NDPPMEAAGF TAQVIILNHP • GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA • IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK • VTKSAQKAQK AK* • C;P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human • C;N;Alternate names: translation elongation factor Tu • C;Species: Homo sapiens (man) • C;Date: 30-Jun-1988 #sequence_revision 05-Apr-1995 #text_change 19-Jan-2001 • C;Accession: B24977; A25409; A29946; A32863; I37339 • C;R;Rao, T.R.; Slobin, L.I. . . .

Look for “>”name, start of definition line. Only one annotation line allowed! Pearson FastA format — • >EFHU1 PIR1 release 71.01 • MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG • KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK • NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV • GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN • MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL • QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS • EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP • GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA • IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK • VTKSAQKAQKAK !!AA_SEQUENCE 1.0 P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human N;Alternate names: translation elongation factor Tu…… F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted <EF1> F;8-156/Domain: translation elongation factor Tu homology <ETU> F;14-21/Region: nucleotide-binding motif A (P-loop) F;153-156/Region: GTP-binding NKXD motif EFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308 .. 1 MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE…… 401 IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK 351 GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA 451 VTKSAQKAQK AK GCG single sequence format — Look for “!!” sequence type, then annotation, then sequence identifier name on the checksum line, then the sequence itself.

!!AA_MULTIPLE_ALIGNMENT 1.0 small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 .. Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00 // …………… GCG MSF & RSF format — The other GCG formats — but these hold more than one sequence at a time. !!RICH_SEQUENCE 1.0 .. { name ef1a_giala descrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list type PROTEIN longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala} sequence-ID Q08046 checksum 7342 offset 23 creation-date 07/11/2001 16:51:19 strand 1 comments ……………. This is SeqLab’s native format

Specialized ‘sequence’ -type DB’s — • Databases that contain special types of sequence information, such as patterns, motifs, and profiles. These include: REBASE, EPD, PROSITE, BLOCKS, ProDom, Pfam . . . . • Databases that contain multiple sequence entries aligned, e.g. RDP and ALN. • Databases that contain families of sequences ordered functionally, structurally, or phylogenetically, e.g. iProClass and HOVERGEN. • Databases of species specific sequences, e.g. the HIV Database and the Giardia lamblia Genome Project. • And on and on . . . . See Amos Bairoch’s excellent links page: http://us.expasy.org/alinks.html and the wonderful Human Genome Ensemble Project at http://www.ensembl.org/ that tries to tie it all together.

What about other types of biological databases? • Three dimensional structure databases: • the Protein Data Bank and Rutgers Nucleic Acid Database. • These databases contain all of the 3D atomic coordinate data necessary to define the tertiary shape of a particular biological molecule. The data is usually experimentally derived, either by X-ray crystallography or with NMR, but sometimes it is a hypothetical model. In all cases the source of the structure and its resolution is clearly indicated. • Secondary structure boundaries, sequence data, and reference information are often associated with the coordinate data, but it is the 3D data that really matters, not the annotation. • Molecular visualization or modeling software is required to interact with the data. It has little meaning on its own. See Molecules to Go at http://molbio.info.nih.gov/cgi-bin/pdb/ .

Other types of Biological DB’s — • Still more; these can be considered ‘non-molecular’: • Genomic linkage mapping databases for most large genome projects (w/ pointers to sequences) — H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces, Arabidopsis, E. coli, . . . . • Reference Databases (also w/ pointers to sequences): e.g. • OMIM — Online Mendelian Inheritance in Man • PubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical scientific journals. • Phylogenetic Tree Databases: e.g. the Tree of Life. • Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). • Population studies data — which strains, where, etc. • And then databases that many biocomputing people don’t even usually consider: • e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates . . . .

So how do you access and manipulate all this data? Often on the InterNet over the World Wide Web: Site URL (Uniform Resource Locator) Content Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/ databases/analysis/software PIR/NBRF http://www-nbrf.georgetown.edu/ protein sequence database IUBIO Biology Archive http://iubio.bio.indiana.edu/ database/software archive Univ. of Montreal http://megasun.bch.umontreal.ca/ database/software archive Japan's GenomeNet http://www.genome.ad.jp/ databases/analysis/software European Mol' Bio' Lab' http://www.embl-heidelberg.de/ databases/analysis/software European Bioinformatics http://www.ebi.ac.uk/ databases/analysis/software The Sanger Institute http://www.sanger.ac.uk/ databases/analysis/software Univ. of Geneva BioWeb http://www.expasy.ch/ databases/analysis/software ProteinDataBank http://www.rcsb.org/pdb/ 3D mol' structure database Molecules to Go http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' visualization The Genome DataBase http://www.gdb.org/ The Human Genome Project Stanford Genomics http://genome-www.stanford.edu/ various genome projects Inst. for Genomic Res’rch http://www.tigr.org/ esp. microbial genome projects HIV Sequence Database http://hiv-web.lanl.gov/ HIV epidemeology seq' DB The Tree of Life http://tolweb.org/tree/phylogeny.html overview of all phylogeny Ribosomal Database Proj’http://rdp.cme.msu.edu/index.jsp databases/analysis/software PUMA2 at Argonne http://compbio.mcs.anl.gov/puma2/cgi-bin/ metabolic reconstruction Harvard Bio' Laboratories http://golgi.harvard.edu/ nice bioinformatics links list With a World Wide Web browser and tools like NCBI’s Entrez & EMBL’s SRS

Advantage: Can access the very latest updates. It’s fun and very fast. It can be very powerful and efficient, if you know what you’re doing.Disadvantage: Can be very inefficient, if you don’t know what you’re doing. Also format hassles, and . . . very easy to get lost and/or distracted in cyberspace!Additionally problems sometimes arise with the Net, like bad connections. So what are some of the alternatives . . . ? • Desktop software solutions — public domain programs are available, but . . . complicated to install, configure, and maintain. User must be pretty computer savvy. So, • commercial software packages are available, e.g. Sequencher, MacVector, DNAsis, DNAStar, etc., • but . . . license hassles, big expense per machine, and Internet and/or CD database access all complicate matters!

Therefore, server-based solutions — we’re talking UNIX server computers here. • Again public domain programs exist. But now a VERY cooperative systems manager needs to install, configure, and maintain the system. Therefore a commercial package, e.g. the Wisconsin Package, is often used to simplify matters. • One commercial license fee for an entire institution and very fast, convenient database access on local server disks. Connections from any networked terminal or workstation anywhere! • Within the GCG suite, LookUp is an SRS derivative used to find a sequence of interest from local GCG server databases. • Advantage: Search output is a legitimate GCG list file, appropriate input to other GCG programs; no need to reformat — all GCG. • Disadvantage: DB’s only as new as administrator maintains them.

The Genetics Computer Group — • the Wisconsin Package for Sequence Analysis. • Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the University of Wisconsin, Madison, then a private company for over 10 years, then acquired by the Oxford Molecular Group U.K., and now owned by Pharmacopeia U.S.A. under the new name Accelrys, Inc. • The suite contains almost 150 programs designed to work in a "toolbox" fashion. Several simple programs used in succession can lead to sophisticated results. • Also 'internal compatibility,' i.e. once you learn to use one program, all programs can be run similarly, and, the output from many programs can be used as input for other programs. • Used all over the world by more than 30,000 scientists at over 530 institutions in 35 countries, so learning it here will most likely be useful anywhere else you may end up.

To answer the always perplexing GCG question — “What sequence(s)? . . . .” Specifying sequences, GCG style; in order of increasing power and complexity: • The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs) • The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive. • The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {*}. • Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, one can supply attribute information within list files to specify something special about the sequence.

Logical terms for the Wisconsin Package — • Sequence databases, nucleic acids: Sequence databases, amino acids: • GENBANKPLUS all of GenBank plus EST and GSS subdivisions GENPEPT GenBank CDS translations • GBP all of GenBank plus EST and GSS subdivisions GP GenBank CDS translations • GENBANK all of GenBank except EST and GSS subdivisions SWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBL • GB all of GenBank except EST and GSS subdivisions SWP all of Swiss-Prot and all of SPTrEMBL • BA GenBank bacterial subdivision SWISSPROT all of Swiss-Prot (fully annotated) • BACTERIAL GenBank bacterial subdivision SW all of Swiss-Prot (fully annotated) • EST GenBank EST (Expressed Sequence Tags) subdivision SPTREMBL Swiss-Prot preliminary EMBL translations • GSS GenBank GSS (Genome Survey Sequences) subdivision SPT Swiss-Prot preliminary EMBL translations • HTC GenBank High Throughput cDNA P all of PIR Protein • HTG GenBank High Throughput Genomic PIR all of PIR Protein • IN GenBank invertebrate subdivision PROTEIN PIR fully annotated subdivision • INVERTEBRATE GenBank invertebrate subdivision PIR1 PIR fully annotated subdivision • OM GenBank other mammalian subdivision PIR2 PIR preliminary subdivision • OTHERMAMM GenBank other mammalian subdivision PIR3 PIR unverified subdivision • OV GenBank other vertebrate subdivision PIR4 PIR unencoded subdivision • OTHERVERT GenBank other vertebrate subdivision NRL_3D PDB 3D protein sequences • PAT GenBank patent subdivision NRL PDB 3D protein sequences • PATENT GenBank patent subdivision • PH GenBank phage subdivision • PHAGE GenBank phage subdivision General data files: • PL GenBank plant subdivision • PLANT GenBank plant subdivision GENMOREDATA path to GCG optional data files • PR GenBank primate subdivision GENRUNDATA path to GCG default data files • PRIMATE GenBank primate subdivision • RO GenBank rodent subdivision • RODENT GenBank rodent subdivision • STS GenBank (sequence tagged sites) subdivision • SY GenBank synthetic subdivision • SYNTHETIC GenBank synthetic subdivision • TAGS GenBank EST and GSS subdivisions • UN GenBank unannotated subdivision • UNANNOTATED GenBank unannotated subdivision • VI GenBank viral subdivision • VIRAL GenBank viral subdivision These are easy — they make sense and you’ll have a vested interest.

The List File Format — • An example GCG list file of many elongation 1a and Tu factors follows. As with all GCG data files, two periods separate documentation from data. .. • my-special.pep begin:24 end:134 • SwissProt:EfTu_Ecoli • Ef1a-Tu.msf{*} • /usr/accounts/test/another.rsf{ef1a_*} • @another.list The ‘way’ SeqLab works!

Conclusions — There’s a bewildering assortment of different databases and ways to access and manipulate the information within them. The key is to learn how to use that information in the most efficient manner. A comprehensive sequence analysis software suite, such as the Wisconsin Package, expedites the chore, putting a large assortment of tools all under one organizational model with one user interface. FOR EVEN MORE INFO... Contact me (stevet@bio.fsu.edu) for specific bioinformatics assistance and/or collaboration.

BioInformatics Databases