Introduction to Perl

Introduction to Perl Part III:Biological Data Manipulation

Column data • Column delimited data • Often CSV, comma delimited • Tab, space, or other character delimited • Process data in scripts instead of using Excel

GFF formats • 8 columns, tab delimited • seq_id, source, feature, start, end, score, strand, frame, group • http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml • Actually 3 or 4 different versions, GFF3 will hopefully be new emerging standard

BLAST -m9 output • The columns are • Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score • Lines starting with ‘#’ are comments

# TBLASTN 2.2.9 [May-01-2004] # Query: GLEAN_08256_1 pchr_1:join(complement(573214..573255),complement(573124..573187),complement(572985..573058),complement(572786..572934),complement(572639..572731),complement(572221..572582),complement(571948..572165),complement(571753..571899),comp # Database: /data/blast/cryptococcus_neoformans_JEC21.20050114.fa # Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score GLEAN_08256_1 cn-jec21_chr7 59.09 264 75 2 141 371 719953 719162 1.6e-90 298.5 GLEAN_08256_1 cn-jec21_chr7 38.30 188 116 5 399 586 719019 718588 3.7e-20 96.67 GLEAN_08256_1 cn-jec21_chr7 34.81 158 49 3 6 109 720617 720144 2.9e-17 87.04 GLEAN_08256_1 cn-jec21_chr7 34.78 161 76 6 623 754 718322 717846 4.4e-29 86.27 GLEAN_08256_1 cn-jec21_chr7 71.79 39 11 0 592 630 718465 718349 4.4e-29 60.85 GLEAN_08256_1 cn-jec21_chr7 42.47 73 42 2 69 141 720171 719989 1.6e-90 54.30 GLEAN_08256_1 cn-jec21_chr11 30.48 105 73 2 442 546 570850 571152 0.015 38.12 GLEAN_08256_1 cn-jec21_chr12 26.55 113 80 2 463 572 702790 703092 0.650 32.73

Process data, filter by a percent id, print GFF BLAST cols: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score open(IN, $filename) || die $!; while (<IN>) { chomp; my @cols = split(/\t/,$_); next if $cols[2] < 40; my ($start,$end,$strand) = ( $cols[6], $cols[7],‘+’); if( $start > $end ) { ($start,$end,$strand) = ( $end,$start,’-’); } print join(“\t”, $cols[0], ‘BLAST’, ‘HSP’,$start,$end,$cols[10], “Target=$cols[1]+$cols[8]+$cols[9]”); # Target=subject+start+end } GFF cols: seq_id, source, feature, start, end, score, strand, frame, group

Microarray data • Lots of columns, R/G channels • Want to add a log/transform • sort data • get subset of data

Filter rows ORF Name G1 G1.Bkg R1 R1.Bkg F1 G2 G2.Bkg R2 R2.Bkg F2 G3 G3.Bkg R3 R3.Bkg F3 G4 G4.Bkg R4 R4.Bkg F4 G5 G5.Bkg R5 R5.Bkg F5 G6 G6.Bkg R6 R6.Bkg F6 G7 G7.Bkg R7 R7.Bkg F7 G1.Ratio G1.Ratio G2.Ratio G3.Ratio G4.Ratio G5.Ratio G6.Ratio G7.Ratio R1.Ratio R2.Ratio R3.Ratio R4.Ratio R5.Ratio R6.Ratio R7.Ratio G1-Bkg G2-Bkg G3-Bkg G4-Bkg G5-Bkg G6-Bkg G7-Bkg R1-Bkg R2-Bkg R3-Bkg R4-Bkg R5-Bkg R6-Bkg R7-Bkg YHR007C ERG11 8432 2404 7896 1155 0 7173 2561 7484 1984 0 117361598 14679 1323 0 16798 1506 14617 1171 0 12315 1696 9853 914 016111 2667 7599 2445 0 13931 1244 6490 981 0 .89 .84 .76 1.14 1.19 2.61 2.3 1.12 1.19 1.32 .88 .84 .38 .43 6028 4612 10138 15292 10619 13444 12687 6741 5500 13356 13446 8939 5154 5509

Filter rows my $header= <DATA>; my @header_cols = split(/\s+/,$header); my $i = 0; my %header_col_num = map { $_ => $i++ } @header_cols; push @rows; my $index = $header_col_num{‘G2.Ratio’}; while(<DATA>) { my @col = split; if( $col[$index] > 2 ) { push @rows, \@col; } } for my $row ( sort { $a->[$index] <=> $b->[$index] } @rows ) { print $row->[$header_col_num{‘ORF’}], “ “, $row->[$index], “\n”; }

Add a column sub log_2 { return log(shift @_) / log(2); } my $header= <DATA>; my @header_cols = split(/\s+/,$header); my $i = 0; my %header_col_num = map { $_ => $i++ } @header_cols; push @rows; my $index = $header_col_num{‘G2.Ratio’}; while(<DATA>) { my @col = split; my $extra_col = log_2($col[$index]); push @rows, [$col[0], $col[$index], $extra_col]; } for my $row ( sort { $a->[$index] <=> $b->[$index] } @rows ) { print join(“\t”, @$row), “\n”; }

Motif finding with regexps • Want to find a binding site motif in DNA sequence • Find motif in protein sequence

Let’s find SBF binding site • SBF binding site in yeast: • CACGAAA and CGCGAAA • Combine these into • C[AG]CGAAA • Search DNA sequence for these sites

Find one motif my $dna; while(<DATA>) { if(/^>/ ) { last if ( $seen ); $seen = 1; } chomp; $dna .= $_; } if( $dna =~ /(C[AG]CGAAA)/ ) { # found the site but how to # say where it is in the sequence? }

More special variables • ` - back quote (same key as ~) • ‘- single quote (same key as “) • $` - the stuff before the match • $’ - the stuff after the match

Find one motif if( $dna =~ /(C[AG]CGAAA)/ ) { my $location = length($`); printf “$1 found at %d..%d\n”, $location, $location+length($1); }

Find multiple instances while( $dna =~ /(C[AG]CGAAA)/ig ) { my $location = length($`); print “$1 found at $location\n”; }

What about reverse strand? $dna = reverse($dna); $dna =~ tr/CAGT/GTCA/; if( $dna =~ /(C[AG]CGAAA)/ ) { my $location = length($`); printf “$1 found at %d..%d\n”, $location+length($1), $location;}

Making reports • Text reports are great for summarizing output • HTML is an easy and excellent way to summarize output and make it pretty • Allows for linking to other resources

HTML with CGI.pm use CGI qw/:standard/; # equivalent to using namespace std; open(OUT, “>report.html”) || die $!; print OUT header, start_html('Motifs found'), h1('Motifs found'), table(Tr(th([“Motif”,“Chrom”, “Location”])), Tr(td([“CACGAAA”, “I”, “19021..19029”])), ), hr, end_html;

Introduction to Perl