300 likes | 461 Vues
An Introduction to Perl Part 3. CSC8304 – Computing Environments for Bioinformatics - Lecture 9. Objectives. To introduce the Perl programming language Working with files, pattern matching Recommended Books: SAMS – Teach yourself Perl in 24 hours – Clinton Pierce
E N D
An Introduction to PerlPart 3 CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Objectives • To introduce the Perl programming language • Working with files, pattern matching • Recommended Books: • SAMS – Teach yourself Perl in 24 hours – Clinton Pierce • Beginning Perl for Bioinformatics – James Tisdall • The Best way to learn Perl is to read the books, numerous tutorials and to Practice. • These notes are not a comprehensive tutorial – reading extra material is essential CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Working with files • To read or write files in Perl you need to open something called a filehandle • These are another type of variable, that act as a reference between the program and the operating system • STDIN, STDOUT are default, usually connected to the keyboard and monitor • To access a file on the disk you need to create a new filehandle and prepare it by opening the filehandle. • Files are opened using the open function • The open function takes a filehandle as its first argument and a pathname as the second argument. The pathname indicates which file you want to open. • If the open function succeeds it returns a nonzero value. If it fails it returns false. • open (filehandle, pathname) CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Opening files • e.g. • if (open(MYFILE, “mydatafile”)) • { • #Run this if the open succeeds • } • else • { • print “Cannot open mydatafile!\n”; • exit 1; #Exit leaves Perl, a value 1 indicates error • } CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Working with files • Dealing with errors – use defensive programming – anticipate errors • The die function can be used to stop execution and report an error • Line is read as “open or die” • If the open does not succeed (if it returns false) then the logical OR needs to evaluate the right hand argument. If the open succeeds (returns true) then the die is never evaluated • $! Is set to the error condition after the system operation • open (MYFILE, “myfile”) || die “Cannot open myfile: $!\n”; CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Reading from files • Most common way to read from open files is to use the file input operator, also called the angle operator (<>) • To read a filehandle put the filehandle name inside the angle operator and assign the value to a variable • The angle operator in a scalar context reads one line of input from the file. When the file is exausted it returns undefined (undef) • To read and print the entire file, use the following if MYFILE is an open filehandle. • open (MYFILE, “myfile”) || die “Cannot open myfile: $!\n”; • $line=<MYFILE>; #Reading from the filehandle • while(defined($a=<MYFILE>)) • { • print $a; • } CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Reading from files • Another shortcut …. • If angle operators are the only elements inside the conditional expression of a while loop, Perl automatically assigns the input line to the special variable $_ and repeats the loop until the input is exhausted • Once finished reading from a file we must close it • while(<MYFILE>) • { • print $_; • } • close(MYFILE); CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Reading from files • Another shortcut …. • Can use the list context to assign the contents of the file to an array • e.g. • Fills the array from the file • The first line of the file novel.txt is assigned to the first element in @contents, $contents[0], the second line to the second element $contents[1], and so on. • open(MYFILE, “novel.txt”)||die “$!”; • @contents=<MYFILE>; • close(MYFILE); CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Writing to files • To write to a file you must have a filehandle that is open for writing • Syntax is almost identical to one that is open for reading • The > signifies to Perl that the file specified should be overwritten with new data • The >> tells Perl to open a file for writing but if it exists then to append data to the end of it. e.g. • open (filehandle, “>pathname”) • open (filehandle, “>>pathname”) • open (NEWFH, “>output.txt”)||die “Opening output.txt: $!”); • open (NEWFH, “>>logfile.txt”)||die “Opening logfile.txt: $!”); CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Writing to files • The print function is used to put data in the files • Closing the file is very important after writing – otherwise you may lose data • open (SOURCE, “sourcefile”)||die “$!”; • open (DEST, “>destination”)||die “$!”; • @contents = <SOURCE>; • print DEST @contents; • print DEST “and a little bit more \n”; • close (SOURCE); • close (DEST); CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Regular Expressions (REs) and Pattern Matching • Perl has powerful facilities for recognising patterns in an input stream and for picking and choosing data based on these patterns • Regular expressions are a formal method for describing the patterns to be matched • In bioinformatics they are very useful for searching DNA and protein sequences, and filtering the output of other programs such as Blast CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Simple Pattern matching using RE’s • A simple Regular expression (RE) might appear as follows • m stands for ‘match’ • This pattern looks for the letters b-i-o-i-n-f-o-r-m-a-t-i-c-s in sequence • But where is it looking? • The Perl default variable $_ is used unless you specify otherwise • If the pattern specified by m// is found anywhere the match operator returns true. • Thus pattern matches are most often seen in conditional expressions • m/bioinformatics/ • if (m/bioinformatics/) • { • print “Have found the word bioinformatics/n”; • } CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Simple Pattern matching using RE’s • In practice, if we stick to using / we can miss out the m • Inside a regular expression, characters (sometimes called atoms) will match themselves unless they are a ‘metacharacter’ • The list of metacharacters is ^ $ ( ) \ | @ [ { ? . + * • If you need to match a metacharacter with its literal value then simply precede the character with the escape character (a backslash) /I won \$10000000000 hurray/ • if (/bioinformatics/) • { • print “Have found the word bioinformatics/n”; • } CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Simple Pattern matching using RE’s • Variables can also be used in regular expressions • Some points: • Normally matching REs starts at left of target string and moves to right • RE’s matching returns true if and only if the entire pattern can be used to match the target string • The first possible match (the leftmost) is matched first • The largest possible first match is taken. Regular expressions are greedy. chomp removes any line-ending characters • $pat=<STDIN>; • chomp $pat; • $_=“The magic words”; • if (/$pat/) #Look for the user’s pattern • { • print “\“$_\” contains the pattern $pat\n”; • } CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Metacharacters • The dot . inside an RE matches any single character except the newline character • e.g. The pattern /p.t/ matches pot, pat, pit, carpet, python and pup tent • Quantifiers • A quantifier is a kind of metacharacter that tells the regular expression how many things to match • A quantifier can be placed after any single character or a group of characters • The simplest is the + metacharacter • The + causes the preceding character to match at least once or as many times as it can. • e.g. /do+g/ would match: hoounddog, hotdog, doogie, dooooooooogdooog • But not badge, doofus, Doogie, pagoda CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Quantifiers continued • The * metacharacter is similar. It causes the preceding character to be matched zero or more times. • e.g the /t*/ RE means to match as many t’s as possible but if none exist that’s okay • So /car*t/ would match carted, cat, carrrt but not carrot, carl or caart • The ? Metacharacter causes the preceding character to be matched either zero times or once but no more • Using the {} allows us to match an exact number of occurrences • /x{8} x must occur exactly 8 times • /x{8,} x must occur exactly 8 times or more • /x{0,4} x must occur between 0 and 4 times CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Character classes • The Perl Character classes also allow ranges of characters to be matched • Character classes are enclosed in square brackets [ ] • e.g. • [abcde] #Match any of a, b, c, d, or e • [a-e] #Same as above • [0-9] #Match a digit • [A-Za-z]{5} #Match any five alphabetic characters • If the caret (^) occurs as the first character of a character class the character class is negated – i.e. the character class matches any single character that is NOT in the specified range • e.g. /[^A-Z]/ Matches non-uppercase alphabetic characters • Some character classes have been shortened e.g.: • \w A word character, same as [a-zA-Z0-9_] • \W A non word character (the inverse of \w) • \d A digit, same as [0-9] • etc... see reference for complete list CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Grouping etc.. • In REs sets of patterns can be grouped • e.g. /(fr|b|l|fl|cl)og/ matches frog, bog, log, flog or clog • The caret character at the beginning of a regular expression causes the RE to only match the beginning of the line • e.g. /^video/ matches the word video only if it occurs at the beginning of a line • Similarly the dollar sign at the end of the RE causes the pattern to match only the end of the line • e.g. /earth$/ matches earth, but only at the end of the line • The substitution operator s/// allows data to be matched and altered • Syntax is: s/searchpattern/replacement/ • e.g. • $_=“Our house in the middle of our street”; • s/middle/end/; • s/in/at/; CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Multiple Matching • When you use parentheses ( ) in regular expressions, Perl remembers the portion of the target string matched by each parenthesized expression. • For example, here is a regular expression which matches well-formed telephone numbers such as 0191-123-7890: /(\d{4})-(\d{3})-(\d{4})/ Match exactly 4 digits and assign result to $3 Match exactly 4 digits and assign result to $1 Match exactly 3 digits and assign result to $2 • Each portion is remembered as $1, $2, $3 respectively. • $_ = “0191-222-6000”; • if(/(\d{4})-(\d{3})-(\d{4})/) • { • print “The area code is $1\n”; # Prints 0191 • print “The local number is $2-$3\n”; # Prints 222-6000 • } CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Working with variables other than $_ • REs normally work on $_ • REs must be bound to any other variable if we want to search it • The bind operator is =~ • e.g. $_ is called the special variable • $weight=“185 lbs”; • $weight=~s/ lbs//; CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Subroutines & Functions • In Perl user defined functions are groups of code called by a defined name that can be used to carry out a specific set of instructions and return a particular value • In Perl, user-defined functions are called subroutines or subs • They can take arguments and return values to the caller • Scope can be used to limit the set of variables that are visible • Syntax: • sub subroutine_name • { • statement1; • statement2; • } CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Subroutines & Functions • Example • Prompts user for an answer • To invoke or call a subroutine syntax can be • &yesno() • or • yesno() • The second syntax can only be used if the subroutine has been declared in the code already • sub yesno • { • print “Are you sure (Y/N)?”; • $answer=<STDIN>; • chomp $answer; • } CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Subroutines & Functions • Arguments or parameters to subroutines are passed by enclosing them in parenthesis: • Scope – the my operator limits the scope of a variable to a subroutine • sub display_score • { • ($shots, $goals)=@_; • print “She has had $shots shots and scored $goals\n”; • } • display_score(30,10); • sub display_score • { • my $min = 90; # $min is now a private variable • ($shots, $goals)=@_; • print “She has had $shots shots and scored $goals in $min minutes\n”; • } • display_score(30,10); CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Subroutines & Functions • Return can be used to pass back a variable from the subroutine but normally the result is passed back on completion • sub shift_to_uppercase • { • @words=qw(cia fbi nato unicef); • foreach (@words) • { • $_=uc($_); • } • return (@words); • } CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Using Strict and Warnings • There are common tools to help Perl programmers write clean and maintainable code • the strict pragma • the warnings pragma • In Perl 5 onwards they are used as folllows: • use strict; • use warnings; CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Using Strict and Warnings • The strict pragma checks for unsafe programming constructs. • Strict forces a programmer to declare all variables as package or lexically scoped variables. • Strict also forces specific syntax with sub, forcing the programmer to call each subroutine explicitly. • The programmer also needs to use quotes around all strings, and to call each subroutine explicitly • The warnings pragma sends warnings when the Perl compiler detects a possible typographical error and looks for potential problems. • There are a number of possible warnings but warnings mainly look for the most common syntax mistakes. • Make sure you use strict in all of your programs • For some bioinformatics examples: http://examples.oreilly.com/begperlbio/BeginPerlBioinfo.pm CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Summary • Files: opening, closing, reading, writing. • Regular expressions • Pattern matching • Metacharacters, quantifiers, character classes, multiple matching • Functions and subroutines • Strict and Warning pragmas CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Q & A – 1 • What is the effect of open (MYFILE, “myfile”) || die “Cannot open myfile: $!\n”; ? • Is it true that the following are equivalent: while(defined($a=<MYFILE>)) { print $a; } and while(<MYFILE>) { print $_; } ? CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Q & A – 2 • Is it true that the following statement opens a file for complete rewriting open (filehandle, “>>pathname”) • Is it true that in the context of pattern matching /do+g/ would match: hoounddog, hotdog, doogie, doofus, Doogie, pagoda ? • Is it true that /car*t/ would match carted, cat, carrrt but not carrot, carl or caart ? • Is it true that [abcde] and [a-e] are not equivalent in the context of pattern matching ? CSC8304 – Computing Environments for Bioinformatics - Lecture 9
Q & A – 3 • Is it true that &fun() and fun() are equivalent ? • Is it true that the ‘strict’ pragma forces a programmer to declare all variables as package or lexically scoped variables ? • Is it true that the ‘warnings’ pragma forces the programmer to call each subroutine explicitly ? CSC8304 – Computing Environments for Bioinformatics - Lecture 9