BioPerl

BioPerl

Object Oriented Programming Continued – BioPerl Install

System Config file: .ncbirc [NCBI] Data=~/blast/blast-2.2.6/data # this tells blast applications where certain data files are located – such as BLOSUM matrices file: .cshrc (for tcsh/csh) # this line tells blast applications where to look for pre-formatted DB's setenv BLASTDB ~/blast/blast-2.2.6/blastdb # this line is a plain environment variable – adding the blast applications to the "path" variable set path = ( $path ~/blast/blast-2.2.6 ) Example blastall –p blastn –i seq –d yeast.nt

Example – Bio::SearchIO #!/usr/bin/perl # blastPars.pl # taken almost verbatim from perldoc Bio::SearchIO use strict; use Bio::SearchIO; my $in = new Bio::SearchIO(-file => 'bmp4.out'); while( my $result = $in->next_result ) { while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { print "Hit= ", $hit->name, ",Length=", $hsp->length('total'), ",Percent_id=", $hsp->percent_identity, ",hit_string=", $hsp->hit_string, "\n"; } } }

Hit= mm3_dna,Length=14,Percent_id=100,hit_string=ttaattgtaatttt Hit= mm3_dna,Length=13,Percent_id=100,hit_string=cttccctcctccc Hit= mm3_dna,Length=13,Percent_id=100,hit_string=ggcaataacacca Hit= mm3_dna,Length=12,Percent_id=100,hit_string=ccttttaggcca Hit= mm3_dna,Length=12,Percent_id=100,hit_string=tgttttaatcat Hit= mm3_dna,Length=12,Percent_id=100,hit_string=gttattttgttt Hit= mm3_dna,Length=12,Percent_id=100,hit_string=tccttctctttt Hit= mm3_dna,Length=12,Percent_id=100,hit_string=taaactgttaaa Hit= mm3_dna,Length=11,Percent_id=100,hit_string=caaaaggagga Hit= mm3_dna,Length=11,Percent_id=100,hit_string=tcaaagtaaat

HW – Part 2

Installing bioperl Condensed instructions for installing bioperl on CSS (aka ICAEN): (see also INSTALL that comes with bioperl) First – look at www.cpan.org 0) perl –version Note, this will only work with perl version 5.6.0 or higher. (so on CSS, use NX-client download bioperl-1.4.tar.gz (~3 Meg, from www.bioperl.org/DIST, OR www.cpan.org/modules/01modules.index.html, or ftp.cpan.org in /pub/CPAN/modules/by-module/Bio) 1) ftp ftp.cpan.org 1.1) bin 1.2) cd pub/CPAN/modules/by-module/Bio 1.3) get bioperl-1.*.*.tar.gz #bioperl-1.4.tar.gz 1.4) quit 2) gunzip bioperl-1.4.tar.gz 3) tar -xvf bioperl-1.4.tar ( this is approximately 13 meg)

Installing bioperl 3.5) mkdir ~/perl 3.6) mkdir ~/perl/bioperl 3.8) cd bioperl-1.4 4) perl Makefile.PL PREFIX=~/perl/bioperl (if you do it this way -- the "lib" won't work) make make test make install (see installing in private space on next slides) To uninstall, just delete ~/perl/bioperl and ~/perl/bioperl-1.4

5) To use: #!/usr/local/bin/perl use lib "~/perl/bioperl/"; # this is supposed to work ,but did NOT on CSS use Bio::SearchIO; Instead, set environment variable: Bash 5.1) PERL5LIB=~/perl/bioperl/lib/perl5/site_perl/5.10.0; export PERL5LIB Csh 5.1) setenv PERL5LIB ~/perl/bioperl/lib/perl5/site_perl/5.10.0 Mac (bash) 5.1) PERL5LIB=~/perl/bioperl/lib/perl5/site_perl; export PERL5LIB 6) To make docs work (I would just put this in your .cshrc file: Bash: PATH=$PATH:~/perl/bioperl/lib/site_perl/5.10.10; export PATH Csh: set path = ($path ~/perl/bioperl/lib/site_perl/5.10.0) Test with: perldoc Bio::SearchIO 7) Test with sample program FINALLY, please note that the version numbers change over time, and the actual paths may very a little between CPAN and/or bioperl.org It make take some trial and error (it usually does for me).

Try it – Bio::SearchIO #!/usr/bin/perl # blastPars.pl # taken almost verbatim from perldoc Bio::SearchIO use strict; use Bio::SearchIO; my $in = new Bio::SearchIO(-file => 'bmp4.out'); while( my $result = $in->next_result ) { while( my $hit = $result->next_hit ) { while( my $hsp = $hit->next_hsp ) { print "Hit= ", $hit->name, ",Length=", $hsp->length('total'), ",Percent_id=", $hsp->percent_identity, ",hit_string=", $hsp->hit_string, "\n"; } } }

INSTALLING BIOPERL IN A PERSONAL OR PRIVATE MODULE AREA If you lack permission to install perl modules into the standard site_perl/ system area you can configure bioperl to install itself anywhere you choose. Ideally this would be a personal perl directory or standard place where you plan to put all your 'local' or personal perl modules. Note: you _must_ have write permission to this area. Simply pass a parameter to perl as it builds your system specific makefile. Example: perl Makefile.PL LIB=/home/users/dag/My_Local_Perl_Modules make make test make install This tells perl to install bioperl in the desired place, e.g.: /home/users/dag/My_Perl_Modules/Bio/Seq.pm Then in your Bioperl script you would write (NOTE ~/dag/My_Local_Perl_Modules will NOT work): use lib "/home/users/dag/My_Local_Perl_Modules"; use Bio::Seq; To see "perldoc Bio::SearchIO -- you would need to be in directory ~/dag/My_Local_Peral_Modules

SearchIO.pm http://www.bioperl.org/wiki/HOWTO:SeqIO References: http://bioperl.org/ http://bioperl.org/Core/Latest/faq.html http://cpan.org

More notes on bioperl: Windowshttp://bioperl.org/SRC/bioperl-live/INSTALL.WIN 1) Quick instructions for the impatient, lucky, or experienced user. ========================================== Download the ActivePerl MSI from http://www.activestate.com/Products/ActivePerl/ Run the ActivePerl Installer (accepting all defaults is fine). Open a command prompt (Menus Start->Run and type cmd) and run the PPM shell (C:\>ppm). Add two new PPM repositories with the following commands: ppm> rep add Bioperl http://bioperl.org/DIST ppm> rep add Kobes http://theoryx5.uwinnipeg.ca/ppms ppm> rep add Bribes http://www.Bribes.org/perl/ppm Install Bioperl with the following commands: ppm> search Bioperl This returns a numbered list of packages with corresponding version numbers etc. with "Bioperl" in their name. ppm> install <number> Where <number> corresponds to the relevant package and version from the numbered list obtained above. Go to http://www.bioperl.org and start reading documentation.

Another way "cpan" ppm

Windows blast binaries? ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/blast-2.2.14-ia32-win32.exe

Bioperl • large collection of Perl modules (extensions to the Perl language) that aid in the task of writing Perl code • assists with sequence data and associated annotation • access to various types of databases remote (GenBank, EMBL etc) and • local (MySQL, flat files, GFF etc.) for storage and retrieval of sequences. • associated documentation and mailing list (community of bioinformaticists)

Bioperl • "most" bioinformatics and computational biology applications are developed in Unix/Linux environments • more and more programs are being ported to other operating systems like Windows, and many users (often biologists with little background in programming) are looking for ways to automate bioinformatics analyses in the Windows environment. • Perl and Bioperl can be installed natively on Windows NT/2000/XP. • Most of the functionality of Bioperl is available with this type of install

Bioperl • Some programs (BLAST for example) have been ported to Windows. • These can be installed and work quite happily with Bioperl in the native Windows environment. • fairly simple project OR only have access to a computer running Windows, and/or don't mind bumping up against some limitations then Bioperl on Windows may be a good place for you to start. • example, downloading a bunch of sequences from GenBank and sorting out the ones that have a particular annotation or feature works great. • Running a bunch of your sequences against remote or local BLAST, parsing the output and storing it in a MySQL database would be fine also. • Be aware that most Bioperl developers are working in some type of a UNIX environment (Linux, OSX, Cygwin). • If you have problems with Bioperl that are specific to the Windows environment, you may be blazing new ground and your pleas for help on the Bioperl mailing list may get few responses - simply because no one knows the answer to your Windows specific problem. • One solution to this problem that will keep you working on a Windows machine it to install Cygwin, a UNIX emulation environment for Windows.

Bioperl • Perl is a programming language that has been extended a lot by the addition of external modules. • These modules work with the core language to extend the functionality of Perl. • Bioperl is one such extension to Perl. • These modular extensions to Perl sometimes depend on the functionality of other Perl modules and this creates a dependency. • Some Perl modules are so fundamentally useful that the Perl developers have included them in the core distribution of Perl - if you've installed Perl then these modules are already installed

Bioperl • Bioperl is actually a large collection of Perl modules (over 1000 currently) and these modules are split into six packages.

Bioperl Bioperl Group Functions ----------------------------------------------------------------- bioperl (the core) Most of the main functionality of Bioperl. bioperl-run Wrappers to a lot of external programs. bioperl-ext Interaction with some alignment functions and the Staden package. bioperl-db Using bioperl with BioSQL and local relational databases. bioperl-microarray Microarray specific functions. bioperl-gui Some preliminary work on a graphical user interface to some Bioperl functions.

Miscellaneous Various commands and techniques that did not make it into other sections. Useful as a review Valuable (I've used them)

split split /PATTERN/, EXPR, LIMIT split /PATTERN/, EXPR split /PATTERN/ split -- returns an array of strings -- scans the string EXPR -- splits the EXPR string into a list of substrings by delimiters -- delimiters are defined by repeated pattern matching of the regular expression PATTERN -- if it doesn't match, the whole string (EXPR) is returned -- if it matches once, you get 2 strings, etc. -- if PATTERN is omitted, it splits on whitespaces after omitting leading whitespaces (/\s+/ -- if EXPR is omitted, it splits $_ -- If LIMIT is specified, it splits the string into NO MORE than that many fields

Examples @words = split ' ', $text; @tokens = split /[ |,]+/, $text; ($login, $passwd, $remainder) = split /:/, $_, 3; # this splits on ":" – and assigns the first 2 to variables, then the rest is stored in $remainder because of the limit (3)

Split Examples #!/usr/bin/perl $text = "this is a test"; @words = split ' ',$text; foreach (@words) { print "$_\n"; } # this is a test (on separate lines) $text = "this is another, simple test"; @words = split /[ |,]+/,$text; foreach (@words) { print "$_\n"; } # this is another simple test (on separate lines)

splice splice ARRAY, OFFSET, LENGTH, LIST splice ARRAY, OFFSET, LENGTH splice ARRAY, OFFSET -- removes the elements designated by OFFSET and LENGTH, from ARRAY, and replaces them with LIST -- if LENGTH is omitted, then everything after OFFSET is removed -- returns the elements removed

splice examples #!/usr/bin/perl @array = qw/one two 3 4 5 six seven/; @middle = ("three", "four", "five"); print "@array\n"; # one two 3 4 5 six seven splice @array, 2, 3, @middle; print "@array\n"; # one two three four five six seven

glob glob EXPR -- returns the file name expansions of EXPR @files = glob "*"; @files = glob ".* *"; #multiple patterns separated by spaces @files = glob "*.pl";

system system LIST -- executes any program on the system #!/usr/bin/perl system("/mnt/r0-blastdb/blast-bin/blastall –p blastn –d /mnt/r0-blastdb/FormattedDBs/nt – i test.txt –o test.out"); .

back ticks #!/usr/bin/perl $output = `/mnt/r0-blastdb/blast-bin/blastall –p blastn –d /mnt/r0-blastdb/FormattedDBs/nt – i test.txt –o test.out`; print "$output\n"; # command is passed on, and interpreted by the shell # output of command returned

Slides Deprecated

New Version of BPlite • BPlite has actually been "deprecated" • this means that its functionality has been replaced by something else • the code is still available and included, but will not be supported by future versions • Replaced by SearchIO perldoc Bio::Tools::BPlite perldoc Bio::SearchIO

OOP used extensively in BioPerl A subject is a BLAST hit, which should not be confused with an HSP (below). A BLAST hit may have several alignments associated with it. A useful way of thinking about it is that a subject is "analogous" to a gene and HSPs are "analogous" to exons. Subjects have one attribute (name) and one method (nextHSP). An HSP is a high scoring pair, or simply an alignment. Look at: perldoc Bio::Tools::BPlite

BPlite Example – what it looks like to use OOP $report->query; $report->database; while(my $sbjct = $report->nextSbjct) { $sbjct->name; while (my $hsp = $sbjct->nextHSP) { print "querySeq ".$hsp->querySeq."\n"; print "sbjctSeq ".$hsp->sbjctSeq."\n"; print "homologySeq ".$hsp->homologySeq."\n"; } }

use Bio::Tools::BPlite; $blast_file = "Chr16.0.out"; my $report = new Bio::Tools::BPlite('-file' => $blast_file); $rp = $report->query; $db = $report->database; while(my $sbjct = $report->nextSbjct) { $sbjct->name; while (my $hsp = $sbjct->nextHSP) { print "substart ".$hsp->subject->start."\n"; print "subjectend ".$hsp->subject->end."\n"; } }

#!/usr/bin/perl -w # #Input: file_name (blast results file from RPS-BLAST) #Output: list of domain locations relative to database sequence, # and perhaps the genomic sequence with domains emphasized # # Note I broke 5.8.0 in: # /usr/lib/perl5/site_perl # to mimic the fact that students probably do not have bio_perl installed use Bio::Tools::BPlite; use Getopt::Long; if($#ARGV != 3) { die "usage: domainID.pl -bf file -c cutoff_value(0.001)\n"; } &GetOptions("bf=s" => \$blast_file, "c=s" => \$cutoff); my $report = new Bio::Tools::BPlite('-file' => $blast_file); #open(FH,$blast_file); #my $report = new Bio::Tools::BPlite('-fh' => \*FH); $rp = $report->query; print "rp = $rp\n"; $db = $report->database; print "db = $db\n"; while(my $sbjct = $report->nextSbjct) { $sbjct->name; while (my $hsp = $sbjct->nextHSP) { #print "score ".$hsp->score."\n"; #print "bits ".$hsp->bits."\n"; #print "percent ".$hsp->percent."\n"; #print "P ".$hsp->P."\n"; #print "match ".$hsp->match."\n"; #print "positive ".$hsp->positive."\n"; #print "length ".$hsp->length."\n"; #print "querySeq ".$hsp->querySeq."\n"; #print "sbjctSeq ".$hsp->sbjctSeq."\n"; #print "homologySeq ".$hsp->homologySeq."\n"; if($hsp->P <= $cutoff) { print "subjectseqname ".$hsp->subject->seqname."\n"; print "qstart ".$hsp->query->start."\n"; print "qend ".$hsp->query->end."\n"; print "e = ".$hsp->P."\n"; #print "percent = ".$hsp->percent."\n"; #What is this??? print "match = ".$hsp->match."\n"; print "positive = ".$hsp->positive."\n"; print "length = ".$hsp->length."\n"; #print "NT query start ".(3*$hsp->query->start)." (assuming protein input)\n"; #print "NT query qend ".(3*$hsp->query->end)."\n"; #print "querySeq ".$hsp->querySeq."\n"; print "\n"; } #print "substart ".$hsp->subject->start."\n"; #print "subjectend ".$hsp->subject->end."\n"; #print "subjectseqname ".$hsp->subject->seqname."\n"; # $hsp->subject->overlaps($exon); } } # the following line takes you to the next report in the stream/file # it will return 0 if that report is empty, # but that is valid for an empty blast report. # Returns -1 for EOF.

Sample "look" file rp = Random Sequence 500 10 (500 letters) db = blastdb/yeast.nt subjectseqname gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome XVI, complete chromosome sequence qstart 1 qend 16 e = 1.2 match = 16 positive = 16 length = 16 subjectseqname gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome XVI, complete chromosome sequence qstart 321 qend 335 e = 4.7 match = 15 positive = 15 length = 15

./domainID.pl -bf bmp4.out -c 10 -------------------- WARNING --------------------- MSG: SeqFeatureI::seqname() is deprecated. Please use seq_id() instead. --------------------------------------------------- subjectseqname mm3_dna range=chr14:37718906-37723909 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=none qstart 1 qend 278 e = 1e-110 match = 258 positive = 258 length = 278 -------------------- WARNING --------------------- MSG: SeqFeatureI::seqname() is deprecated. Please use seq_id() instead. --------------------------------------------------- subjectseqname mm3_dna range=chr14:37718906-37723909 5'pad=0 3'pad=0 revComp=FALSE strand=? repeatMasking=none qstart 707 qend 818 e = 3e-32 match = 102 positive = 102 length = 112

End

Notes • Note to self – exploring the percent identity, gapped, and non-gapped would be a great assignment • requires random sequence, alignment (clustalw)

BioPerl

BioPerl

Presentation Transcript

Bioperl modules

The Bioperl toolkit: Perl modules for the life sciences

BioPerl

Beginning BioPerl for Biologists

Bioinformatica BioPerl

BioPerl - documentation

BioPerl – An Overview

BioPerl

Bibliographic Query Service in bioperl

BioPerl

BioPerl

Introducing Bioperl

Installing Bioperl

Introduction to bioperl

BioPerl

Introducción a Bioperl

Bioperl modules

BioPerl