Rachel Adams, Jerry Choate, Nathan Harrelson, Divya Mistry, and Whitney Smith

BINF 4360, Fall 2007 Rachel Adams, Jerry Choate, Nathan Harrelson, Divya Mistry, and Whitney Smith

Overview • Goals • Implementation • Interface • Images • Final product • Conclusions

Goals • Create a dynamic map of the Shewenella Oneidensis MR-1 genome • Populate local database with relevant information from web-based databases • Provide an efficient searching algorithm for key terms • Implement user-friendly navigation and readability

Implementation • SQL Schema • Parsing • Databases

Parsing • XPath • XPath was used to quickly parse through XML documents generated from NCBI’s SOAP interface. my $xp=XML::XPath->new(filename=>$file); # gets the locus tag foreach $var ($xp->find('//Gene-ref')->get_nodelist) { $name = $var->find('Gene-ref_locus')->string_value; $locus = $var->find('Gene-ref_locus-tag')->string_value; } • LWP::Simple • Simple was used to grab content from a url so it could be easily written to an XML file. • Regular Expressions • Regular expressions were used to parse through HTML files, match specific string patterns, and manipulate text.

sequence_name name last_value bigint increment_by bigint max_value bigint min_value bigint cache_value bigint log_cnt bigint is_cycled boolean is_called boolean sequence_name name last_value bigint increment_by bigint max_value bigint min_value bigint cache_value bigint log_cnt bigint is_cycled boolean is_called boolean imgplacement img_img_id_seq area_area_id_seq pdb ncbi_genes area kegg img ncbi_proteins img_id integer tilex integer tiley integer locus_tag text date date defintion text description text gene text Schema img_id integer map varchar(5) area_id integer href text title text target text coords text img_id integer id text pdb text id integer name text locus_tag text month integer day integer year integer location text description text function text cog_id text gi text img_id text gene_id text kegg_id text

Databases • NCBI • Local databases were populated using information retrieved from gene, protein, and 3D domain web-based databases. • COG • Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. • IMG • The Integrated Microbial Genomes (IMG) system's goal is to facilitate the visualization and exploration of genomes from a functional and evolutionary perspective. • KEGG • Knowledge-based methods for uncovering higher-order systemic behaviors of the cell and the organism from genomic information is stored in KEGG, Kyoto Encyclopedia of Genes and Genomes.

More Databases • MIST • The Microbial Signal Transduction database contains the signal transduction proteins for 591 complete bacterial and archaeal organisms. • ORNL • The Genome Analysis and System Modeling Group of the Life Sciences Division of ORNL provides bioinformatics and analytic services and resources to collaborators, predicts prospective gene and protein models for analysis, and provides user services for the general community. • PDB • The RCSB PDB provides a variety of tools and resources for studying the structures of biological macromolecules and their relationships to sequence, function, and disease. • ShewCyc • ShewCyc is a part of BioCyc, a collection of 371 Pathway/Genome Databases, which describes the genome and metabolic pathways of the ShewenellaOneidensis MR-1 genome.

Interface • Functions provided by Google’s Map API were used to display pathways of the Shewenella genome. • A small overview map is provided to give a bird’s eye view of the entire image. The current view is indicated with a translucent box. • The user has the ability to view the pathways using 5 different zoom levels. • Text balloons show information relevant to the user’s selected target. • A search bar offers quick targeting of a user’s query of interest. • The user can either pan over the images and click on areas of interest or enter a query in a search bar to find specific information. • If the user submits a term to be queried, relevant targets are indicated on the map with colored pins.

Images • ImageMagick is a free software suite to create, edit, and compose bitmap images. • The main functions that we took advantages of included the ability to resize, sharpen, pad, and stitch together images. • We also were able to create a composite image by combining several (212) separate images. • Placing the images within 16384 by 16384 pixels took strategic manipulation and tedious offset calculation.

Zoomed image Final Product

Query for glycogen Final Product

Query for ATP Final Product

Conclusions • Using GoogleMaps we were able to create a searchable map of pathways in the Shewenella genome. • Efficient parsing methods made collecting and querying data far simpler. • With more time, additional improvements could be implemented to increase the usability of this application. • Currently we offer links to images, but it would be optimal to have thumbnails of the pictures themselves readily viewable. • GoogleWebToolkit has several functions that would make more information available for the user. Tabs on text balloons could separate data into topical subgroups. Overlaying a transparent map on top of the current map could be a useful tool for comparing two pathways. • Additionally, the overall scope of the project would be enhanced if we had even more indepth zoom levels such that the user could actually see the sequence of the amino acids and nucleotides.

Rachel Adams, Jerry Choate, Nathan Harrelson, Divya Mistry, and Whitney Smith