1 / 25

Text Operations

Text Operations. Liqi Gao. Utilities. C/C++ Library Perl (Active Perl) Regular Expression Edit Plus / Ultra Edit Excel. C/C++ Language. Standard library: Read a line Remove a CR or LF Split a line. C++ Boost Library Case Conversion Trimming Replace Algorithm Finding Algorithm

nasia
Télécharger la présentation

Text Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Operations Liqi Gao

  2. Utilities • C/C++ Library • Perl (Active Perl) • Regular Expression • Edit Plus / Ultra Edit • Excel

  3. C/C++ Language • Standard library: • Read a line • Remove a CR or LF • Split a line • C++ Boost Library • Case Conversion • Trimming • Replace Algorithm • Finding Algorithm • Split

  4. C/C++: Read a Line • Though it’s simple, it’s useful! • Three methods:

  5. C/C++: Remove CR/LF • Get a line under Windows and Linux platform

  6. C/C++: Remove CR/LF (cont.) • The noising CR • Carriage Return

  7. C/C++: Split a Line • Split a line by a specific character

  8. C/C++: Split a Line (cont.) • Split a line

  9. C++ Boost: Case Conversion • to_upper: Convert a string to upper case • to_lower: Convert a string to lower case

  10. C++ Boost: Trimming & Replace

  11. C++ Boost: Split • split(): splits the input into parts

  12. Regular Expression • Regular expression is a powerful tool for string operations.

  13. An Example • *\([0-9/ ]+\) *[0-9\.\?]+% empty • ^( *)([0-9]+)( *) \2\t

  14. An Introduction to Perl Excels at pattern search and text manipulation (Practical Extraction and Reporting Language) Open source / free software • Cheap! Free and available for all systems • can use and install without restriction • open source promotes portability • vastly expandable through freely available modules (add-on libraries at CPAN repository) • fewer restrictions/lower cost for commercial use • can buy fancy development tools if desired • centralized source, linear development path avoids vendor vicissitudes and incompatibilities!

  15. #!/usr/bin/perl $x = 6e9; print “Hello world!\n”; printf “All %d of you!\n”, $x; Perl Interpreter #include <stdio.h> int main() { float x; x = 6e9; printf(“Hello world!\n”); printf(“All %d of you!\n”, x); } 10001110110011000111011100001110111000110111011000111000110111010100110111001011001101101101010101000111001110001101010101101010101001011101011101100011111000 ... C Compiler Perl is not compiled Hello world! All 6000000000 of you! C (compiled) Perl is not compiled C Compiler • Source Code • Plain text (ASCII) • Human readable • Human editable • Platform Independent • Binary Executable • NOT human readable • NOT human editable • NOT platform independent!

  16. A Taste of Perl: print a message perltaste.pl: Greet the entire world. #!/usr/bin/perl -w - command interpretation header $x = 6e9; - variable assignment statement print “Hello world!\n”; printf “All %d of you!\n”, $x; } - function calls (output statements)

  17. Scalar Values Numerical Values • integer: 5, “3”, 0, -307 • floating point: 6.2e9, -4022.33 • hexadecimal/octal: 0x0d4f, 0477 • NOTE:all numerical values stored as floating-point numbers (usu. “double” precision)

  18. String Values • Double-quoted: interpolates (replaces variable name/control character with it’s value) • Single-quoted: no interpolation done (as-is) • Quoting operators: qq//, qw//, etc. $day = “Monday”; “Happy Monday!\n” Happy Monday!<NL> “Happy $date!\n” Happy Monday!<NL> ‘Happy Monday!\n’ Happy Monday!<NL> ‘Happy $date!\n’ Happy $date!\n

  19. String Manipulation Concatenation $dna1 = “ACTGCGTAGC”; $dna2 = “CTTGCTAT”; • juxtapose in a string assignment or print statement $new_dna = “$dna1$dna2”; • Use the concatenation operator ‘.’ $new_dna = $dna1 . $dna2; • Add segments serially using incremental concatenation: $new_dna = $dna1; $new_dna .= $dna2; (shorthand for:$new_dna = $new_dna . $dna2;)

  20. Substitution DNA transcription: T  U Substitution operator s//: $dna = “GATTACATACACTGTTCA”; $rna = $dna; $rna =~ s/T/U/; # “GAUUACAUACACUGUUCA” Exercise: Start with $dna =“gattACataCACTgttca”; and do the same as above. Print out $rna to the screen.

  21. transcribe.pl: $dna =“gattACataCACTgttca”; $rna = $dna; $rna =~ s/T/U/g; print "DNA: $dna\n"; print "RNA: $rna\n"; Does it do what you expect? If not, why not? • Patterns in substitution are case-sensitive! What can we do? • Convert all letters to upper (or lower) case (preferred when possible) • If we want to retain mixed case, use transliteration operatortr// • $rna =~ tr/tT/uU/;

  22. Case conversion $string = “acCGtGcaTGc”; Upper case: $dna = uc($string); # “ACCGTGCATGC” or$dna = uc $string; or$dna = “\U$string”; Lower case: $dna = lc($string); # “accgtgcatgc” or$dna = “\L$string”; Sentence case: $dna = ucfirst($string) # “Accgtgcatgc” or$dna = “\u\L$string”;

  23. Perl in NLP • Look up in Dictionary • Word Frequency • Chinese Word Segmentation • POS • …… • Whatever you could need

  24. Case study

  25. Thanks for your attention

More Related