What is a wrapper? There is no formal definition for this. I define it this way: A short script that calls an existing

Wrapper What is a wrapper? There is no formal definition for this. I define it this way: A short script that calls an existing program (executable), parses the result(s) and then save the final result in a file. Wrapper is more specific to Perl because other languages are awkward/clumsy to do this. Why we need a wrapper? We do not need to re-invent the wheel from the scratch.

How to run blast and parse the BLAST output and then save the result in an array @blastparsed = `echo $oligo | /usr/local/biobin/blastall -p blastn -d $subjectf -F F –W 6 -g F -a 4 | /usr/local/biobin.dev/blast-parser.pl`; }

#! c:/Perl/perl.exe -w use strict; my (@line, @parsed, $temp); my $infile=shift; my $output=shift; my $ussage="Ussage:\n$0 <input_file> <$output> \n"; unless ($infile && $output){ print "$ussage"; exit; } open (OUT, "$output") || die "Can not open input file -- $output \n"; open (IN, "$infile") || die "Can not open input file -- $infile \n"; (…. Continued )

while (<IN>) { s/\r\n?/\n/; # remove return chomp; $temp = $_; @line = split "\t",$_; qx(rm $oligo); open (TMP,">$oligo") || die "Can't open tmp file"; print TMP ">$line[0]\n$line[1]\n"; close(TMP); qx(mfold SEQ=$oligo NA=DNA T=43 NA_CONC=0.6 W=2 MAX=30); $temp=$opt_t . "." . "out"; @parsed= qx(perlparse_mfold_result.pl -i $temp); print OUT "@parsed"; } close(IN); close(OUT);

The use of Perl for gene annotation With the high-throughput sequencing technologies (e.g. Solexa, 454), we now can produce a few terabytes of sequence data per day in a single lab. Exponential increase of the amount of genomic sequence from various species need to be annotated. Bioinformatics solutions are increasingly required to develop automatic annotation techniques to support and complement the manual curation process

The generic structure of an automatic genome annotation pipeline and delivery system (Cited from Haili Ping)

Automation of gene and genome annotation pipelines • Primary goal is to deliver highly accurate and reliable gene and genome annotations using the widest range of evidence from existing literatures and databases. • Essence : pipelines should contain suites of bioinformatics software tools that can interact with multiple databases, and integrate various related information to for a given gene for genome. • Trend : Consensus-based approaches combined results of gene predictors and similarity search methods are used

Automated annotation pipelines EBI/Sanger Institute Ensembl Project: http://www.ensembl.org/Homo_sapiens/ NCBI Human Genome Browser: http://proxy.library.uiuc.edu:3367/genome/guide/human/ The Oak Ridge National Laboratories Genome Channel: http://compbio.ornl.gov/channel/ Celera Discovery System: http://cds.celera.com/ Incyte Genomics ¯ Genomics Knowledge Platform: http://www.incyte.com/incyte_science/technology/gkp/ Paracel GeneMatcher2 System: http://www.paracel.com/products/gm2.html Human genome browsers UCSC Human Genome Browser: http://genome.cse.ucsc.edu/cgi-bin/hgGateway/ Softberry Genome Explorer: http://www.softberry.com/berry.phtml?topic=genomexp Viaken Enterprise Ensembl Solution: http://www.viaken.com/ns/solutions/ensembl.html LabBook Inc. Genomic Explorer Suite: http://www.labbook.com/products/ExplorerSuite.asp University of Tokyo Gene Resource Locator Browser: http://grl.gi.k.u-tokyo.ac.jp/ Other useful sites The Institute for Genomic Research (TIGR): http://www.tigr.org/ Human Genome Central: http://www.ensembl.org/genome/central/ and http://proxy.library.uiuc.edu:3528/genome/central/

From raw sequence to gene predictions • Raw sequence pre-processing • masking known repeats and low comlexity sequences using RepeatMasker • identifying homology matches using BLAST • Scans for other features, such as sequence tagged site (STS) markers and CpG islands • Gene prediction • Predictions based on protein matches • Predictions based on DNA sequence • Ab initio gene prediction programs

A simplified schematic of algorithmic gene prediction

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins.

Gene Function Characterization • Mapping to known genes • RefSeq and SWISS-PROT • Human Genome Organization (HUGO) (NCBI,UCSC and Ensemble) • Protein domain annotation • Pam, PRINTS, PROSITE, ProDom, BLOCKS and SMART. • Interpro project :creating a unique characterization for a given protein family, domain or functional site. Domains of the protein sequences can then be identified using this signature method. The use of Interpro provides the least-redundant and extensive annotation currently available • Gene ontology • Gene Ontology (GO) project aims at defining such common terms to specify molecular function, biological process and cellular location

Future opportunities • Comparative genomics As more genomes are sequenced and become publicly available in the next few years, comparative genomics will become one of the greatest areas of development • Cross-species Analysis : human-mouse Protein coding genes are likely to be highly conserved between closely related species (e.g. mouse and human), and other regions, such as RNA genes and regulatory regions, could also be elucidated • need for the development of bioinformatics tools the integration of such tools with the current automated approaches the design of genome browsers and websites that can intelligently display and annotate comparative results

References : 1.Genome annotation techniques: new approaches and challenges,Drug Discovery Today, Volume 7, Issue 11, 6 May 2002, Pages 570-576 Alistair G. Rust, Emmanuel Mongin and Ewan Birney Loraine AE, Helt GA. 2.Discovering new genes with advanced homology detection, Trends in Biotechnology, Volume 20, Issue 8, 1 August 2002, Pages 315-316 Weizhong Li and Adam Godzik 3.Biswas M, O'Rourke JF, Camon E, Fraser G, Kanapin A, Karavidopoulou Y, Kersey P, Kriventseva E, Mittard V, Mulder N, Phan I, Servant F, Apweiler R. Applications of InterPro in protein annotation and genome analysis. Brief Bioinform. 2002 Sep;3(3):285-95. PMID: 12230037 [PubMed - in process] http://www.ebi.ac.uk/interpro/ 4.Visualizing the genome: techniques for presenting human genome data and annotations. BMC Bioinformatics. 2002 Jul 30;3(1):19. http://www.pubmedcentral.gov/articlerender.fcgi?tool=pubmed&pubmedid=12149135 5.Oshiro G, Wodicka LM, Washburn MP, Yates JR 3rd, Lockhart DJ, Winzeler EA. Parallel identification of new genes in Saccharomycescerevisiae. Genome Res. 2002 Aug;12(8):1210-20. PMID: 12176929 [PubMed - indexed for MEDLINE] http://www.genome.org/cgi/content/full/12/8/1210

What is a wrapper? There is no formal definition for this. I define it this way: A short script that calls an existing