Perl for Bioinformatics: Basic Data Types and Array Manipulations

Programming and Perlfor BioinformaticsPart II

Basic Data Types • Perl has three basic data types: • scalar • array (list) • associative array (hash)

Extract 2nd item from @names Extract the sublist from @names Arrays • An array (list) is an ordered list of scalar values. • ‘@’ is used to refer to the entire array • Example: • (1,2,3) # Array of three values 1, 2, and 3 • ("one","two","three") # Array of 3 values "one", "two", "three" • @names = ("mary", "tom", "mark", "john", "jane"); • $names [1] ; ? • @names [1..4]; # “tom”

More on Arrays • @a = ( ); # empty list • @b = (1,2,3); # three numbers • @c = ("Jan","Joe","Marie"); # three strings • @d = ("Dirk",1.92,46,"20-03-1977"); # a mixed list • Variables and sublists are interpolated in a list • @b = ($a, $a+1, $a+2); # variable interpolation • @c = ("Jan", ("Joe","Marie") ); # list interpolation • @d = ("Dirk", 1.92,46,( ), "20-03-1977"); # empty list interpolation • @e = ( @b, @c ); # same as (1,2,3,"Jan","Joe","Marie") • Practical construction operators ($x..$y) • @x = (1..6) # same as (1, 2, 3, 4, 5, 6) • @y = (2..5, 8, 11..13) # same as (2,3,4,5,8,11,12,13)

Array Example • # Here's one way to declare an array, initialized with a list of four # scalar values. @bases = ('A', 'C', 'G', 'T'); • # Now we'll print each element of the array print "Here are the array elements:"; print "\nFirst element: "; print $bases[0]; print "\nSecond element: "; print $bases[1]; print "\nThird element: "; print $bases[2]; print "\nFourth element: "; • This code snippet prints out: Here are the array elements: • First element: A • Second element: C • Third element: G • Fourth element: T

Print Array • You can print the elements one after another like this: @bases = ('A', 'C', 'G', 'T'); print "\n\nHere are the array elements: "; print @bases; • It produces the output: • Here are the array elements: ACGT

Converting a string to an array split splits a variable into parts and puts them in an array. $dnastring = "ACGTGCTA"; @dnaarray =split ( //, $dnastring ) ; #@dnaarray is now (A, C, G, T, G, C, T, A) @dnaarray =split ( /T/, $dnastring ) ; #@dnaarray is now (ACG, GC, A)

Converting an array to a string • joincombines the elements of an array into a single scalar variable (a string) $dnastring = join('', @dnaarray); spacer (empty here) which array

Array Manipulations reverse Reverses the order of array elements @a = (1, 2, 3); @b = reverse @a; # @b = (3, 2, 1); split Splits a string into a list/array $line = "John Smith 28"; ($first, $last, $age) = split (/\s/, $line); #\s: white spaces [\t\n\f\r] $DNA = "ACGTTTGA"; @DNA = split ("", $DNA); join Joins a list/array into a string $gene = join ( "", ($exon1, $exon3) ) ; $name = join ( "-", ("Zhong", "Hui")) ; scalar Returns the number of elements in @array scalar @array;

Array Manipulations - pop • You can take an element off the end of an array with pop: @bases = ('A', 'C', 'G', 'T'); $base1 = pop @bases; print "Here's the element removed from the end: "; print $base1, "\n\n"; print "Here's the remaining array of bases: "; print "@bases"; • which produces the output: Here's the element removed from the end: T Here's the remaining array of bases: A C G

Array Manipulations - shift • You can take a base off of the beginning of the array with shift: @bases = ('A', 'C', 'G', 'T'); $base2 = shift @bases; # shift left print "Here's an element removed from the beginning: "; print $base2, "\n\n"; print "Here's the remaining array of bases: "; print "@bases"; • which produces the output: Here's an element removed from the beginning: A Here's the remaining array of bases: C G T

Array Manipulations - push • You can put an element on the end of the array with push: @bases = ('A', 'C', 'G', 'T'); $base2 = shift @bases; push (@bases, $base2);# return the number of elements in the array after push print "Here's the element from the beginning put on the end: "; print "@bases\n\n"; • It produces the output: Here's the element from the beginning put on the end: C G T A

Array Manipulations - unshift • You can put an element at the beginning of the array with unshift: @bases = ('A', 'C', 'G', 'T'); $base1 = pop @bases; unshift (@bases, $base1); print "Here's the element from the end put on the beginning:"; print "@bases\n\n"; • It produces the output: Here's the element from the end put on the beginning: T A C G

Exercise #Determine freq of nucleotides $dna ="gaTtACataCACTgttca"; ?

Filehandles File I/O (input/output): reading from/writing to files • Files represented in Perl by a filehandle variable (for clarity, written as a bare word in UPPERCASE) • Open a file on a filehandle using the open function • for reading (input): open INFILE, “<datafile.txt”; or open (INFILE, “<datafile.txt”); • for writing (output), overwriting the file: open OUTFILE, “>output”; • for appending to the end of the file: open OUTFILE, “>>output”; • Close a file on a filehandle • Close (OUTFILE);

Special Filehandles Special “files” that are always “open” • STDIN (standard input) • input from command window read only • STDOUT (standard output) • output to command window write only print STDOUT “Have fun with Perl!\n”; or just print “Have fun with Perl!\n”;

Input from Filehandles “Angle Bracket” input operator • reads one line of input (up to newline/carriage return) • from STDIN: print "Enter name of protein: "; $line = <STDIN>; chomp $line;# removes \n from end of $line print “\nYou entered $line.\n”; • from a file: open ( INPUTFILE, “prot1.seq”); $line1 = <INPUTFILE>; # first line chomp $line1; $line2 = <INPUTFILE>; # second line # Perl reads files one line at a time # … etc

sequences.fasta >gi|145536|gb|L04574.1|Escherichia coli DNA polymerase III chi subunit gene, complete cds TAACGGCGAAGAGTAATTGCGTCAGGCAAGGCTGTTATTGCCGGATGCGGCGTGAACGCCTTATCCGACC TACACAGCACTGAACTCGTAGGCCTGATAAGACACAACAGCGTCGCATCAGGCGCTGCGGTGTATACCTG ATGCGTATTTAAATCCACCACAAGAAGCCCCATTTATGAAAAACGCGACGTTCTACCTTCTGGACAATGA CACCACCGTCGATGGCTTAAGCGCCGTTGAGCAACTGGTGTGTGAAATTGCCGCAGAACGTTGGCGCAGC GGTAAGCGCGTGCTCATCGCCTGTGAAGATGAAAAGCAGGCTTACCGGCTGGATGAAGCCCTGTGGGCGC GTCCGGCAGAAAGCTTTGTTCCGCATAATTTAGCGGGAGAAGGACCGCGCGGCGGTGCACCGGTGGAGAT CGCCTGGCCGCAAAAGCGTAGCAGCAGCCGGCGCGATATATTGATTAGTCTGCGAACAAGCTTTGCAGAT TTTGCCACCGCTTTCACAGAAGTGGTAGACTTCGTTCCTTATGAAGATTCTCTGAAACAACTGGCGCGCG AACGCTATAAAGCCTACCGCGTGGCTGGTTTCAACCTGAATACGGCAACCTGGAAATAATGGAAAAGACA TATAACCCACAAGATATCGAACAGCCGCTTTACGAGCACTGGGAAAAGCAGGGCTACTTTAAGCCTAATG GCGATGAAAGCCAGGAAAGTTTCTGCATCATGATCCCGCCGCCGAA

Determine frequency of nucleotides • Input file: sequences.fasta open (INPUTFILE, "sequences.fasta"); #open file for sequence $line1 = <INPUTFILE>; $line2 = <INPUTFILE>; $line3 = <INPUTFILE>; chomp ($line2, $line3); $dna = $line2.$line3; $count_A = 0; $count_C = 0; $count_G = 0; $count_T = 0; @dna = split '', $dna; foreach $base (@dna) { if ($base eq 'A') {$count_A++;} elsif ($base eq 'C') {$count_C++;} elsif ($base eq 'G') {$count_G++;} elsif ($base eq 'T') {$count_T++;} else {print "error!\n";} } print "count of A = $count_A \n"; print "count of C = $count_C \n"; print "count of G = $count_G \n"; print "count of T = $count_T \n";

Read a File: line by line my $my_sequence; open FILE1, “/u/doej01/prot1.seq”; while ($line = <FILE1>){ chomp($line); $my_sequence=$my_sequence.$line; }; close ( FILE1 ); • Dumps the whole file into the variable : my_sequence

Using loops to read in a file • The whileloop just keeps doing an expression while it’s true. So it will keep reading lines from the file until it runs out. • The special variable $_ keeps track of the line of the file we’re on. my $longsequence; open FILE, ‘exampleprotein.txt’; while (<FILE>){ $longsequence = $longsequence . $_ ; chomp $longsequence; } close FILE; • This reads the whole file, and puts each line into the variable $longsequenceone at a time.

Read a File into an Array • Rather than read a file one line at time into a scalar variable, it is often helpful to read the entire file into an array open FILE1, “prot1.seq”; @DNA = <FILE1>; #array of strings

Writing to a File • Writing to a file is similar to reading from it • Use the > operator to open a file for writing: open OUTPUT,‘>/home/achou/output.txt’; • This creates a new file with that name, or overwrites an existing file • Use >> to append text to an existing file • print to the file using the filehandle: print OUTPUT $myoutputdata;

Perl for Bioinformatics: Basic Data Types and Array Manipulations

Perl for Bioinformatics: Basic Data Types and Array Manipulations

Presentation Transcript

Perl Programming for Biology

Programming and Perl for Bioinformatics Part IV

Perl Programming: Developing Key Tools for Bioinformatics

Perl Programming

perl programming

Perl Programming for Biology

Perl for Bioinformatics

Advanced Perl For Bioinformatics

Programming and Perl for Bioinformatics Part I

Programming and Perl for Bioinformatics Part I

Perl Programming

Perl Programming

Programming and Perl for Bioinformatics Part III

Perl Programming

Introduction to Perl Part II

Programming for Bioinformatics

Perl Programming for Biology

Perl Programming for Biology

Introduction to Perl for Bioinformatics

Perl for Bioinformatics Part 2

Perl Programming

Perl Programming