480 likes | 663 Vues
An Introduction to Perl. MBG8680 2006 Gerard Tromp. References. Books: Wall L, Christiansen T, Orwant J. Programming Perl. Sebastopol, CA: O'Reilly, 2000:1-1070 Cozens S. Advanced Perl Programming Sebastopol, CA: O'Reilly, 2005:1-281
E N D
An Introduction to Perl MBG8680 2006 Gerard Tromp
References • Books: • Wall L, Christiansen T, Orwant J. Programming Perl. Sebastopol, CA: O'Reilly, 2000:1-1070 • Cozens S. Advanced Perl Programming Sebastopol, CA: O'Reilly, 2005:1-281 • Christiansen T, Torkington N. Perl Cookbook. Sebastopol, CA: O'Reilly, 1998:1-757 • Perl Manual pages • Web: (not exhaustive – try google: learning perl) • http://www.oreilly.com/ • http://www.perl.com/ (O’Reilly maintains) • http://www.cpan.org (Comprehensive Perl Archive Network) • http://learn.perl.org/
What is Perl? • Scripting language • Interpreted at run-time • Developed as improved awk/nawk • Data/Text extraction tool on UNIX • Aho, Weinberger and Kernigan (Bell Laboratories) • A. Aho, B. Kernighan, and P. Weinberger. AWK -- A pattern scanning and processing language. Software Practice and Experience, 9(4):267--280, 1979 • Extremely powerful pattern matching capabilities (regular expression engine)
What is Perl? (2) • Extensible • Modules and Packages • (CPAN: www.cpan.org) • General programming language • Can be used for: • system calls (date, time, sockets, network) • file IO • Complex programming tasks • Genome builds are performed with Perl
What is Perl? The official description. • Perl is a general-purpose programming language originally developed for text manipulation and now used for a wide range of tasks including system administration, web development, network programming, GUI development, and more. • The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal). Its major features are that it's easy to use, supports both procedural and object-oriented (OO) programming, has powerful built-in support for text processing, and has one of the world's most impressive collections of third-party modules.
Some important concepts • Perl uses punctuation and some characters to distinguish specific meaning (as do most computer languages). • Train your eyes to note the difference between: • ( ), [ ], { } – important delimiters • $, @, % – important data types – variables
Basics – variables • Variable Syntax: • Variables contain data • Types • scalar $ $foo simple value, e.g., string, number • array @ @foo list of values • hash % %foo paired lists of keys – values • subroutine & &foo block (chunk) of code that can be called • typeglob * *foo all things called foo
Basics – functions (procedures) • Function syntax: • Perl does not distinguish between functions, procedures and subroutines (other languages do) • Function syntax is defined in the manual pages • man “function name” • see perdoc perlfunc • Some functions take no arguments, other variable/optional arguments, e.g., • print FILEHANDLE LIST(LIST is list of variables) • print LIST • print
Basics – operators • Operators “do things”: • Mathematical • addition +$foo + $bar • multiplication *$foo * $bar • division /$foo / $bar • subtraction -$foo - $bar • modulus %$foo % $bar • exponentiation **$foo ** $ bar
Basics – operators (2) • assignment • simple = $a = 3; $a=“abc” • complex • mathematical *= multiply $a *= 3 ($a==9) -= subtract $a -= 4 ($a==5) += subtract $a += 5 ($a==10) • string .= concatenate $a .= “d” ($a==abcd) x= repeat $a x= 3 ||= conditional $a ||= “a”
Basics – operators (3) • Logical • and &&, and $a && $b • or ||, or $a || $b • not !, not ! $a • xor xor $a xor $b
Basics – operators (4) • Test numeric string • equality == eq • inequality != ne • less than < lt • greater than > gt • less than or equal<= le • comparison <=> cmp
Basics – control • Flow control (execute till condition is met) • conditional • if if( CONDITION ){ } if( CONDITION ){ } elsif( CONDITION ){ } else( CONDITION ){ } • unless unless( CONDITION ){ } • while while( CONDITION ){ } • for for( $a=1; $a<10; $a++ ){ } • foreach foreach( LIST ){ }
Basics – control (2) • Flow control (execute till condition is met) • termination • next next; next if ( CONDTION); skips current loop • last last; last if ( CONDTION); terminates loop
Text Manipulation in Perl. • Text manipulation was the primary reason for developing Perl originally • The text manipulation “engine” in Perl is an extended Unix Regular Expression (REGEX) • History • Derived from “regular sets” (mathematical language theory) • Part of Unix editors ‘qed’ and ‘ed’ -> grep/egrep • Incorporated into sed, awk (nawk) • Extended in some current versions of Unix (Linux) to reflect the Perl extensions • Incorporated into Java regular expressions
Regular Expressions • Way to specify a set of strings without enumerating each possibility • Way to specify a pattern to match • Distinct syntax • Delimiters /PATTERN/ traditional ?PATTERN? almost any other character • Metacharacters • special interpretation to specific characters/character combinations
Regular Expression – Metacharacters (2) • Unix escape characters (metacharacters) • \ – backslash • “escapes” meaning of special (non-alphanumeric) character, e.g., $,%,^ • converts some alphabetical characters into special metacharacters • \n newline • \r carriage return • \t tab • \f form-feed • \a alarm (BEL) • \0 ASCII NULL • \e escape
Regular Expression – Quantifiers • Quantifiers allow specification of how many times the previous character/pattern should be matched • Originally limited in Unix • * match 0 or more times • {Min, Max} match at least Min times and no more than Max times • {Min,} match at least Min times • {,Max} match no more than Max times • {Count} match exactly count times
Regular Expression – Capturing • Capturing allows (a portion of) the pattern to be used elsewhere • Originally limited in Unix (awk/sed) • \(PATTERN\) escaped parentheses • Captured pattern(s) stored in buffers: $1, $2 … $n • For input line: “This is a test” the pattern: /\([Tt]his\).*\(t[es]*t\)/ yields two buffers: $1 == “This”; $2 == “test”
Regular Expression – Perl Capturing • Capturing allows (a portion of) the pattern to be used elsewhere • In Perl – do NOT escape parentheses • (PATTERN) parentheses • Captured pattern(s) stored in variables: $1, $2 … $n • For input line: “This is a test” the pattern: /([Tt]his).*(t[es]*t)/ yields two variables: $1 == “This”; $2 == “test”
Perl quotes • Different quote characters have specific meaning and properties. • Interpolation • is the expansion of variables • occurs for some quote types but not others
Variable assignment • $x = “abc”; • @x = ( abc, def, ghi, klm); • %x = (1, abc, 2, def, 3, ghi, 4, klm); • what does the following produce? • print $x, “\n”; • print $x[3], “\n”; • print $x{2}, “\n”; abc klm def What happened and why?
A Simple Command-line Script Using an Array Type the following on a line in the PuTTY window (shell window) perl –e ‘@x=(2,5,7,9,11); print “@X\n";’ perl –e ‘@x=(2,5,7,9,11); print “$x[4]\n";’ perl –e ‘@x=(2,5,7,9,11); foreach $x (@x) {print “$x\n"; }’ NOTE: command-line scripts are tricky since the entire script must be enclosed in single quotes
A Simple Command-line Script Using a Hash Type the following on a line perl –e ‘%x=(2,5,7,9,11,15); print “%x\n";’ perl –e ‘%x=(2,5,7,9,11,15); print “$x{5}\n";’ perl –we ‘%x=(2,5,7,9,11,15); print “$x{5}\n";’ perl –e ‘%x=(2,5,7,9,11,15); print “$x{7}\n";’ perl –e ‘%x=(2,5,7,9,11,15); foreach $x (keys %x) {print “$x\t$x{$x}\n"; }’
A Simple (file) Program A program to extract specific URL data from html generated by NCBI Map viewer “view as table” #! /usr/bin/perl –w while(<>){ if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){ print "$1$2\t$2\t$3\n"; &mysub($1,$2,$3); } } submysub{ # … do something; }
Dissection of a simple program • Examine program line by line 1 #! /usr/bin/perl –w 2 3 while(<>){ 4 if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){ 5 print "$1$2\t$2\t$3\n"; 6 &mysub($1,$2,$3); 7 } 8 } 9 10 submysub{ 11 # … do something; 12 }
Dissection of a simple program (2) • Invocation line 1#! /usr/bin/perl –w • This line ‘starts’ the Perl program • Syntax is derived from Unix shell script syntax • #! (pound-bang) • tells Unix shell that the next arguments is the name or path and name of a program (executable) • /usr/bin/perl • tells Unix shell which executable (perl) to find in which path (directory location) • -w • “flag(s)” passed to executalbe (program) • tell program to “do things” or adopt specific behavior • here: turn on perl warnings
Dissection of a simple program (3) • Control loop and input operator 3 while(<>){ # elided lines 4 – 7 8 } • while ( CONDTION ) BLOCK • execute loop until condition becomes false • here CONDITION is <> , an input operator • reads from STDIN, a C filehandle accessible to every program • reads until the end-of-file, i.e., until no further data • BLOCK is a block (chunk) of code
Dissection of a simple program (4) • IF LOOP – IF ( CONDITION) BLOCK 4 if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){ 5 print "$1$2\t$2\t$3\n"; 6 &mysub($1,$2,$3); 7 } • if ( /PATTERN/ ) BLOCK • if PATTERN matches execute the BLOCK • href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)< • what are literals? • what are character classes? • what does the pattern match?
Dissection of a simple program (5) • IF LOOP – IF ( CONDITION) BLOCK 4 if ( /href=\"(http:.*?list_uids=)(\d+)\">([-\w\*]+?)</ ){ 5 print "$1$2\t$2\t$3\n"; 6 &mysub($1,$2,$3); 7 } • print "$1$2\t$2\t$3\n"; • what does the line do? • what is $1, $2 and $3? where is it in the pattern? • &mysub($1,$2,$3); • what is &mysub? • what are $1, $2 and $3 with respect to mysub?
Dissection of a simple program (6) • Subroutine 10 submysub{ 11 # … do something; 12 } • submysub BLOCK • subroutine declaration and code BLOCK • BLOCK consists of { code} • everything after # is a comment • this is a null subroutine – does nothing • perl does not require declaration of parameters • all perl parameters are made available to subroutine as an array – @_
perldoc perldoc perldoc perldoc perl perldoc perlintro perldoc perlfaq perldoc 'topic' very important program how to use perldoc list of available topics* useful material like this lecture common questions answered – information on specified topic from list above Getting Help * perldoc will extract documentation embedded in packages. The list returned by ‘perldoc perl' is for the base perl installation
Getting Help (2) • Books – see references • Web – see references • Unix ‘man’ command. • Although perldoc will return help/information for most perl-related items, there are still a few that only have ‘man pages’
Hands-on Problems • Write a Perl script that will do the following. • 1: for chromosome 20, create a tab-delimited list of: • gene names • gene ids (GeneID number) • chromosomal location (beginning, end) • orientation • 2: extend the columns to include (where appropriate): • HUGO HGNC ID • OMIM ID
Improvements on the script(s) • Wouldn’t it be great if you could skip the browsing part and go straight to the web page in Perl? • look at LWP module • http://search.cpan.org/dist/libwww-perl/lib/LWP.pm • What is a ‘module’?
Perl modules • A module is a collection of scripts (code) that have already been written for you • Strictly speaking, a module is a collection of one or more packages • A package is small collection of code package NAME; BLOCK 1;
Perl modules (2) • Why packages? • allows namespace to be uncluttered • keeps related code in one place • allows reusability of code • Modules? • can think of as extended packages • can be procedural (traditional) or object-oriented
Perl modules (3) • modules must be installed from source (CPAN) • module included in script by: • use MODULE; • executes the module at compile time • complains immediately if not found • require MODULE; • executes the module at run time • only complains later
Perl modules (4) • Module allows access to module specific functions (methods) • Some Modules have hundreds of functions • Functions are written as generically as possible to make them extensible
Perl modules (5) • DBI • database interface • abstract database interface that makes database access as generic as possible • DBI::DBD • DBI database driver (specific to database or interface, e.g., Oracle, Sybase, MySQL, WINODBC32) • performs the database-specific calls and allows DBI to ‘hide’ them from the user • interprets DBI generic calls to database in database-specific manner
Modules Insufficient time to delve into these very important bioinformatic modules • DBI • http://search.cpan.org/~timb/DBI-1.51/DBI.pm • BioPerl • http://search.cpan.org/~birney/bioperl-1.4/Bio/Perl.pm • http://www.bioperl.org/wiki/Main_Page • http://www.bioperl.org/wiki/Bptutorial.pl • http://doc.bioperl.org/releases/bioperl-1.4
Homework Problem • You have performed a large-scale SNP genotyping project. • The data are provided to you in a tabular list in the following format: • Some header lines • includes blank lines • column descriptions • columns • Gene ID • Polymorphism ID • Fragment (no data [-]) • Subject ID • Allele 1 • Allele 2
Homework Problem (2) • You have to write a script to transform the data into a wide table that has • Individual ID as rows • Polymorphisms as columns • Genotype data as a string “Allele1/Allele2” • Polymorphisms must be grouped by gene • Genes must be in order (left to right) • Notes • There will be about 5,300 individuals, 200 genes and a total of about 1,300 polymorphisms • the solution is to use hashes and nested hashes