190 likes | 305 Vues
This document outlines a database schema for managing gene and protein information, functional annotations, biological processes, cellular components, and molecular functions. Key tables include GeneIDTable for gene details, GeneFuncTable for gene functions, ProteinFuncTable for protein functions, and PathwayTable for gene-pathway associations. Steps to populate the database from NCBI resources are provided, guiding users in fetching and parsing genomic data to create a structured dataset. The schema supports biological research by linking genes to their functions and pathways.
E N D
Database Schema • GeneIDTable • Information about “gene” and corresponding “protein” • gene_id, gene_name, gene_seq, protein_id, protein_name, protein_seq, gene_type • gene_id – primary key (type varchar(255)) • gene_type type varchar(255) • All other entries are of type longtext
Database Schema • GeneFuncTable • Information about “gene functions” • gene_id, gene_fun, comment • gene_id – foreign key • All entries are of type longtext
Database Schema • ProteinFuncTable • Information about “protein functions” • protein_id, protein_fun, comment • All entries are of type longtext
Database Schema • PathwayFuncTable • Information about “pathway functions” • pathway_id, pathway_name, pathway_fun, pathway_loc, comment All entries are of type longtext
Database Schema • PathwayTable • Information about “gene pathway association” • gene_id, pathway_id • gene_id type varchar(255) • pathway_id type longtext
Database Schema • BiologicalProcessTable • Gene Ontology related table • Information about “biological processes” of a particular gene • gene_id, GO_num, biological_process • gene_id – foreign key (type varchar(255)) • All other entries are of type longtext
Database Schema • CellularComponentTable • Gene Ontology related table • Information about “cellular component” • gene_id, GO_num, cellular_component • gene_id – foreign key (type varchar(255)) • All other entries are of type longtext
Database Schema • MolecularFunctionTable • Gene Ontology related table • Information about “molecular functions” • gene_id, GO_num, molecular_function • gene_id – foreign key (type varchar(255)) • All entries are of type longtext
Steps to Follow – Step 1 • Get the RefSeq Accession Number of your species from the NCBI Genome database • e.g. NC_000913 for Escherichia Coli K12
Steps to Follow – Step 2 • Downloading files needed using the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov) • genomes/Bacteria/[species name]/[RefSeq #].gbk (main information for genes and proteins and GO functions) • e.g. genomes/Bacteria/Escherichia_coli_k12/NC_000913.gbk • genomes/Bacteria/[species name]/[RefSeq #].ffn (gene sequence) • e.g. genomes/Bacteria/Escherichia_coli_k12/NC_000913.ffn
Steps to Follow – Step 3 • Go to KEGG selected organisms (http://www.genome.jp/kegg/catalog/org_list.html) • Find your species and click the second column of the species (e.g. eco for E Coli) • Go to “pathway maps” to get pathway information to put into the PathwayFunc table
Steps to Follow – Step 4 • Use eutils function of NCBI Entrez to get the file that contains gene pathway association (http://eutils.ncbi.nlm.nih.gov/entrez/eutils/) • Use esearch to search your species in the gene database http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=database&term=query&usehistory=y • Use efetch to fetch the result file • http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=database&WebEnv=WebEnvString&query_key=key
Steps to Follow – Step 5 • Edit .gbk file to remove the beginning and the end part • Parse the .gbk and the .ffn file to fill all the tables except the PathwayFunc table and Pathway table • Link to the sample parser file • Parse.java
Steps to Follow – Step 6 • Parse the eutils resulting file to get the gene pathway association • Link to the sample parsePath file • ParsePath.java
Database Name Format • Example species Escherichia Coli K12 • Species name: Escherichia_Coli_K12 • Database name: escherichia_coli_k12
Sample Output File • outputFile.txt (output file after parsing .gbk and .ffn files) • outputPath.txt (output file after parsing gene pathway association file) • PathwayFunc.txt (output file after analyzing KEGG pathways)
To Find the Number of Genes • Search your species in NCBI gene database • e.g. Escherichia Coli K12 [orgn] • Check the number of genes in your result with this number
Submit your project (the 3 output files, the parsers if any changes) to: • vgummulu@cise.ufl.edu • Any questions: • yizhang@cise.ufl.edu • anupamd@ufl.edu