Understanding Perl Data Types and Regular Expressions in Bioinformatics

Programming and Perlfor BioinformaticsPart III

Basic Data Types • Perl has three basic data types: • scalar • array (list) • associative array (hash)

Associative Arrays/Hashes • List of scalar values (like array) • Elements referred to by key, not index number • Elements stored as a list of key-value pairs %threeletter = ('A','ALA','V','VAL','L','LEU'); key value key value key value print $threeletter{'A'};# “ALA” print $threeletter{'L'};? • exists checks if a specific hash key exists if ($threeletter{'E'}) print ($threeletter{'E'}); ? print "Exists\n" if exists $array{$key}; print "Defined\n" if defined $array{$key}; print "True\n" if $array{$key};

Getting all keys and values in a hash %threeletter = ('A','ALA','V','VAL','L','LEU'); • keys returns a list of all keys • values returns a list of all values • each returns one key-value pair each time it’s called ($key, $val) = each %threeletter; • Unlike array, not an ordered list (order of key-value pairs determined by the Perl interpreter) foreach $k ( keys %threeletter ) { print $k;} # Might return, for instance, “A L V”, # not “A V L” (need not to be sorted) foreach $v ( values %threeletter ) { print $v;} ?

Associative Arrays • Some common functions: • keys(%hash) #returns a list of all the keys • values(%hash) #returns a list of all the values • each(%hash) #each time this is called, it will #return a 2 element list #consisting of the next #key/value pair in the array • delete($hash{[key]}) #remove the pair associated #with key

More on Perl • Subroutines and Functions • A way to organize a program • Wrap up a block of code • Have a name • Provide a way to pass values to the block and report back the results • Regular expression

Basics about Subroutines • # define a subroutine sub myblock { my ($arg1, $arg2, $arg3, …, $argN) = @_; # @_ is special variable containing args print "Please enter something: "; } • # function call myblock($arg1, $arg2, …, $argN); • Example sub add8A { my ($rna) = @_; $rna .= "AAAAAAAA"; return $rna; } #the original rna $rna = "CGAAUCUAGGAU"; $longer_rna = add8A($rna); print "I added 8 As to $rna to get $longer_rna.\n";

More example sub denaturizing { my (@products) = @_; my @strands = (); foreach $pairs (@products) { ($A,$B) = split /\s/, $pairs; @strands = (@strands, $A, $B); } return @strands; } #templates are in the form "A B". Ex. “ACGT TGCA” @Denatured = denaturizing(@PCRproducts);

Variables Scope • A variable $a is used both in the subroutine and in the main part program of the program. $a = 0; print "$a\n"; sub changeA { $a = 1; } print "$a\n"; changeA(); print "$a\n"; • The value of $a is printed three times. Can you guess what values are printed? • $a is a global variable use strict; my $a = 0; print "$a\n"; sub changeA { my $a = 1; } print "$a\n"; changeA(); print "$a\n";

Ex: What would be the output? #!/usr/bin/perl -w $dna = 'AAAAA'; $result = A_to_T($dna); print "I changed all the A's in $dna to T's and got $result\n\n"; ############################################# # Subroutines sub A_to_T { my($input) = @_; $dna = $input; $dna =~ s/A/T/g; return $dna; } Output?

Regular Expressions • Regular Expressions: Language for specifying text strings • Regular Expressions is a mechanism for specifying character patterns • Useful for • Finding files by name • Finding text in a file • Finding (or not finding) interesting text in a string • Text based search and replace • Finding and extracting text

Pattern Finding Problem: find an ORF in nucleotide sequence • Look for start (ATG) and stop codons (TAA, TAG, TGA) • Pattern search operator: m// or // • $string =~ /<pattern>/returns true if the pattern matches somewhere in $string, false otherwise • Example: $dna = "GATGCCATGACACTGTTCA"; if ($dna =~ /ATG/){ print "starting codon is there"; } else { print "no starting codon!\n"; }

*+ Stephen Cole Kleene Regular Expressions • Optional characters ? ,* and + • /colou?r/  colororcolour • ? (0 or 1) • /oo*h!/ oh!orooh!orooooh! • * (0 or more) • /o+h!/ oh!orooh!orooooh! • + (1 or more) • Wild cards . • /beg.n/  beginorbeganorbegun

Common Regular Expressions White-space characters \t (tab), \n (newline), \r (return) \s : match a whitespace character x : character 'x' . : any character except newline ^r : match at beginning of line r$ : match at end of line r|s : match either or (r) : group characters (to be saved in $1, $2, etc) [xyz] : character class, in this case, matches either an 'x', a 'y', or a 'z' [abj-oZ] : character class with a range in it; matches 'a', 'b', any letter from 'j' through 'o', or 'Z' r* : zero or more r's, where r is any regular expression r+ : one or more r's r? : zero or one r's (i.e., an optional r) {name} : expansion of the "name" definition rs : RE r followed by RE s (e.g., concatenation)

Exercise Ex1: $dna = AGGCTCGTACGACG; if( $dna =~ /CT[CGT]ACG/ ) { print "I found the motif!!\n"; #? } Ex2: Find an ORF in nucleotide sequence (look for start (ATG) and stop codons (TAA, TAG, TGA)) $dna = "tatggagcctcctgaggctacagccacacctgagccactctaaga"; ?

Understanding Perl Data Types and Regular Expressions in Bioinformatics

Understanding Perl Data Types and Regular Expressions in Bioinformatics

Presentation Transcript

Perl Programming for Biology

Programming and Perl for Bioinformatics Part IV

Perl Programming: Developing Key Tools for Bioinformatics

Programming and Perl for Bioinformatics Part II

Perl Programming

perl programming

Perl Programming for Biology

Perl for Bioinformatics

Advanced Perl For Bioinformatics

Programming and Perl for Bioinformatics Part I

Programming and Perl for Bioinformatics Part I

Perl Programming

Perl Programming

Programming and Perl for Bioinformatics Part III

Perl Programming

Programming for Bioinformatics

Perl Programming for Biology

Perl Programming for Biology

Introduction to Perl Part III

Introduction to Perl for Bioinformatics

Perl for Bioinformatics Part 2

Perl Programming