1 / 22

Intermediate Perl Programming

Intermediate Perl Programming. Todd Scheetz July 18, 2001. Review of Perl Concepts. Data Types scalar array hash Input/Output open(FILEHANDLE,”filename”); $line = <FILEHANDLE>; print “$line”; Arithmetic Operations +, -, *, /, % &&, ||, !. Review of Perl Concepts. Control Structures

nedaa
Télécharger la présentation

Intermediate Perl Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intermediate Perl Programming Todd Scheetz July 18, 2001

  2. Review of Perl Concepts • Data Types • scalar • array • hash • Input/Output • open(FILEHANDLE,”filename”); • $line = <FILEHANDLE>; • print “$line”; • Arithmetic Operations • +, -, *, /, % • &&, ||, !

  3. Review of Perl Concepts • Control Structures • if • if/else • if/elsif/else • foreach • for • while

  4. Regular Expressions • General approach to the problem of pattern matching • RE’s are a compact method for representing a set of possible strings without explicitly specifying each alternative. • For this portion of the discussion, I will be using {} to represent the scope of a set. • {A} • {A,AA} • {Ø} = empty set

  5. Regular Expressions • In addition, the [] will be used to denote possible alternatives. • [AB] = {A,B} • With just these semantics available, we can begin building simple Regular Expressions. • [AB][AB] = {AA, AB, BA, BB} • AA[AB]BB = {AAABB,AABBB}

  6. Regular Expressions • Additional Regular Expression components • * = 0 or more of the specified symbol • + = 1 or more of the specified symbol • A+ = {A, AA, AAA, … } • A* = {Ø, A, AA, AAA, … } • AB* = {A, AB, ABB, ABBB, … } • [AB]* = {Ø, A, B, AA, AB, BA, BB, AAA, … }

  7. Regular Expressions What if we want a specific number of iterations? A{2,4} = {AA, AAA, AAAA} [AB]{1,2} = {A, B, AA, AB, BA, BB} What if we want any character except one? [^A] = {B} What if we want to allow any symbol? . = {A, B} .* = {Ø, A, B, AA, AB, BA, BB, … }

  8. Regular Expressions All of these operations are available in Perl Several “shortcuts” \d = {0, 2, 3, 4, 5, 6, 7, 8, 9} \w+\s\w+ = {…, Hello World, … }

  9. Pattern Matching • Perl supports built-in operations for pattern matching, substitution, and character replacement • Pattern Matching • if($line =~ m/Rn.\d+/) { • ... • } • In Perl, RE’s can be a part of the string rather than the whole string. • ^ - beginning of string • $ - end of string

  10. Pattern Matching Back references… if($line =~ m/(Rn.\d+)/) { $UniGene_label = $1; }

  11. Regular Expressions $file = “my_fasta_file”; open(IN, $file); $line_count = 0; while($line = <IN>) { if($line =~ m/^\>/) { $line_count++; } } print “There are $line_count FASTA sequences in $file.\n”;

  12. Pattern Matching UniGene data file ID Bt.1 TITLE Cow casein kinase II alpha … EXPRESS ;placenta PROTSIM ORG=Caenorhabditis elegans; … PROTSIM ORG=Mus musculus; PROTGI=… SCOUNT 2 SEQUENCE ACC=M93665; NID=g162776; … SEQUENCE ACC=BF043619; NID=… // ID Bt.2 TITLE Bos taurus cyclin-dependent … ...

  13. Pattern Matching Let’s write a small Perl program to determine how many clusters there are in the Bos taurus UniGene file.

  14. Pattern Matching Now we’ll build a Perl program that can write an HTML file containing some basic links based on the Bos taurus UniGene clustering. Important: http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=GID_HERE&dopt=GenBank

  15. Substitution Pattern matching is useful for counting or indexing items, but to modify the data, substitution is required. Substitution searches a string for a PATTERN and, if found, replaces it with REPLACEMENT. $line =~ s/PATTERN/REPLACEMENT/; Returns a value equal to the number of times the pattern was found and replaced. $result = $line =~ s/PATTERN/REPLACEMENT/;

  16. Substitution • Substitution can take several different options. • specified after the final slash • The most useful are • g - global (can substitute at more than one location) • i - case insensitive matching • $string = “One fish, Two fish, Red fish, Blue fish.”; • $string =~ s/fish/dog/g; • print “$string\n”; • One dog, Two dog, Red dog, Blue dog.

  17. Substitution • Example: Removing leading and trailing white-space • $line =~ s/^\s*(.*?)\s*$/$1/; • a *? performs a minimal match… • it will stop at the first point that the remainder of the expression can be matched. • $line =~ s/^\s*(.*)\s*$/$1/; • this statement will not remove trailing white-space, instead the white space is retained by the .*

  18. Character Replacement • A similar operation to substitution is character replacement. • $line =~ tr/a-z/A-Z/; • $count_CG = $line =~ tr/CG/CG/; • $line =~ tr/ACGT/TGCA/; • $line =~ s/A/T/g; • $line =~ s/C/G/g; • $line =~ s/G/C/g; • $line =~ s/T/A/g;

  19. Character Replacement while($line = <IN>) { $count_CG = $line =~ tr/CG/CG/; $count_AT = $line =~ tr/AT/AT/; } $total = $count_CG + $count_AT; $percent_CG = 100 * ($count_CG/$total); print “The sequence was $percent_CG CG-rich.\n”;

  20. Subroutines One of the most important aspects of programming is dealing with complexity. A program that is written in one large section is generally more difficult to debug. Thus a major strategy in program development is modularization. Break the program up into smaller portions that can each be developed and tested independently. Makes the program more readable, and easier to maintain and modify.

  21. Subroutines • EXAMPLE: • Reading in sequences from UniGene.all.seq file • Multiple FASTA sequences in a single file, each annotated with the UniGene cluster they belong to. • GOAL: • Make an output file consisting only of the longest sequence from each cluster.

  22. Subroutines • ISSUES: • 1. Want to design and implement a usable program • 2. Use subroutines where useful to reduce complexity. • 3. Minimize the memory requirements. • (human UniGene seqs > 2 GB)

More Related