Chapter 9: Text Processing

Chapter 9: Text Processing Pattern MatchingData Compression

Outline and Reading • Strings (§9.1.1) • Pattern matching algorithms • Brute-force algorithm (§9.1.2) • Knuth-Morris-Pratt algorithm (§9.1.4) • Regular Expressions and Finite Automata • Data Compression • Huffman Coding • Lempel-Ziv Compression

Motivation: Bioinformatics • The application of computer science techniques to genetic data • See Gene-Finding notes • Many interesting algorithm problems • Many interesting ethical issues!

A string is a sequence of characters Examples of strings: Java program HTML document DNA sequence Digitized image An alphabet S is the set of possible characters for a family of strings Example of alphabets: ASCII Unicode {0, 1} {A, C, G, T} Let P be a string of size m A substring P[i .. j] of P is the subsequence of P consisting of the characters with ranks between i and j A prefix of P is a substring of the type P[0 .. i] A suffix of P is a substring of the type P[i ..m -1] Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P Applications: Regular expressions Programming languages Search engines Biological research Strings

Pattern matching • Suppose you want to find repeated ATs followed by a G in GAGATATATATCATATG. • How do you express that pattern to find? • How can you find it efficiently? • How if the strings were billions of characters long?

Finite Automata and Regular Expressions • How do I match perl-like regular expressions to text? • Important topic: regular expressions and finite automata. • theoretician: regular expressions are grammars that define regular languages • programmer: compact patterns for matching and replacing

Regular Expressions • Regular expressions are one of • a literal character • a (regular expression) – in parentheses • a concatenation of two REs • the alternation (“or”) of two REs, denoted + in formal notation • the closure of an RE, denoted * (ie 0 or more occurrences) • Possibly additional syntactic sugar • Examples abracadabra abra(cadabra)* = {abra, abracadabra, abracadabracadabra, … } (a*b + ac)d (a(a+b)b*)* t(w+o)?o [? means 0 or 1 occurrence in Perl] aa+rdvark [+ means 1 or more occurrences in Perl]

Finite Automata • Regular language: any language defined by a RE • Finite automata: machines that recognize regular languages. • Deterministic Finite Automaton (DFA): • a set of states including a start state and one or more accepting states • a transition function: given current state and input letter, what’s the new state? • Non-deterministic Finite Automaton (NFA): • like a DFA, but there may be • more than one transition out of a state on the same letter (Pick the right one non-deterministically, i.e. via lucky guess!) • epsilon-transitions, i.e. optional transitions on no input letter

REs in common use • Syntactic sugar: • [a-c,x-z]: match one of a, b, c, x, y, z • [^abc]: match a character that is not an a, b, or c • .: match any character • ?: match 0 or 1 instances of what preceeded • \s: match a whitespace character • ^, $: match the beginning or end of string • ([pattern]): make [pattern] available in substitutions as $1, $2, etc

Examples • Perl examples (and other languages): $input =~ s/t[wo]?o/2/; $input =~ s|<link[^>]*>\s*||gs; $input =~ s|\s*\@font-face\s*{.*?}||gs; $input =~ s|\s*mso-[^>"]*"|"|gis; $input =~ s/([^ ]+) +([^ ]+)/$2 $1/; $input =~ m/^[0-9]+\.?[0-9]*|\.[0-9]+$/; ($word1,$word2,$rest) = ($foo =~ m/^ *([^ ]+) +([^ ]+) +(.*)$/); $input=~s|<span[^>]*>\s*<br\s+clear="?all[^>]*>\s*</span>|<br clear="all"/>|gis;

Multiples of 3? • /^([0369]|[258][0369]*[147]|[147]([0369]|[147][0369]*[258])*[258]|[258][0369]*[258]([0369]|[147][0369]*[258])*[258]|[147]([0369]|[147][0369]*[258])*[147][0369]*[147]|[258][0369]*[258]([0369]|[147][0369]*[258])*[147][0369]*[147])*$/

DFA for (AT)+C • Note that DFA can be represented as a 2D array, DFA[state][inputLetter]  newstate • DFA: state letter newstate 0 A 1 0 TCG 0 1 T 2 1 ACG 0 2 C 4 [accept] 2 GT 0 2 A 3 3 T 2 3 AGC 0 4 AGCT 0

RE  NFA • Given a Regular Expression, how can I build a DFA? • Work bottom up. • Letter: • Concatenation: • Or: Closure:

RE  NFA Example • Construct an NFA for the RE(A*B + AC)D A A* A*B A*B + AC (A*B + AC)D

NFA -> DFA • Keep track of the set of states you are in. • On each new input letter, compute the new set of states you could be in. • The set of states for the DFA is the power set of the NFA states. • I.e. up to 2n states, where there were n in the DFA.

Recognizing Regular Languages • Suppose your language is given by a DFA. How to recognize? • Build a table. One row for every (state, input letter) pair. Give resulting state. • For each letter of input string, compute new state • When done, check whether the last state is an accepting state. • Runtime? O(n), where n is the number of input letters • Another approach: use a C program to simulate NFA with backtracking. Less space, more time.

Data Compression: Intro • Suppose you have a text, abracadabra. Want to compress it. • How many bits required? at 3 bits per letter, 33 bits. • Can we do better? • How about variable length codes? • In order to be able to decode the file again, we would need a prefix code: no code is the prefix of another. • How do we make a prefix code that compresses the text?

Huffman Coding • Note: Put the letters at the leaves of a binary tree. Left=0, Right=1. Voila! A prefix code. • Huffman coding: an optimal prefix code • Algorithm: use a priority queue. insert all letters according to frequency if there is only one tree left, done. else, a=deleteMin(); b=deleteMin(); make tree t out of a and b with weight a.weight() + b.weight(); insert(t)

Huffman coding example • abracadabra frequencies: • a: 5, b: 2, c: 1, d: 1, r: 2 • Huffman code: • a: 0, b: 100, c: 1010, d: 1011, r: 11 • bits: 5 * 1 + 2 * 3 + 1 * 4 + 1 * 4 + 2 * 2 = 23 • Follow the tree to decode – Q(n) • Time to encode? • Compute frequencies – O(n) • Build heap – O(1) assuming alphabet has constant size • Encode – O(n)

Huffman coding summary • Huffman coding is very frequently used • (You use it every time you watch HTDV or listen to mp3, for example) • Text files often compress to 60% of original size (depending on entropy) • In real life, Huffman coding is usually used in conjunction with a modeling algorithm…

Data compression overview • Two stages: modeling and entropy coding • Modeling: break up input into tokens or chunks (the bigger, the better) • Entropy Coding: use shorter bit strings to represent more frequent tokens • If P is the probability of a code element, the optimal number of bits is –lg(P)

Lempel-Ziv Modeling • Consider compressing text • Certain byte strings are more frequent than others: the, and, tion, es, etc. Model these with single tokens • Build a dictionary of the byte strings you see; the second time you see a byte string, use the dictionary entry

Lempel-Ziv Compression • Start with a dictionary of 256 entries for the first 256 characters • At each step, • Output the code of the longest dictionary match and delete those characters from input • Add previous token plus last letter as new dictionary entry with code 256, 257, 258, … • Note that code lengths grow by one bit as dictionary reaches size 512, 1024, 2048, etc.

Lempel-Ziv Example ABRACADABRA Output Add to Dictionary 1 (A) 2 (B) AB 5 (R) BR 1 (A) RA 3 (C) AC 1 (A) CA 4 (D) AD 6 (AB) DA 8 (RA) ABR Dictionary: • A • B • C • D • R • AB • BR • RA • AC • CA • AD • DA • ABR

Lempel-Ziv Variations • All compression algorithms like zip, gzip use variations on Lempel-Ziv • Possible variations: • Fixed-length vs. variable length codes or adaptive Huffman or arithmetic coding • Don’t add duplicate entries to the dictionary • Limit the number of codes or switch to larger ones as needed • Delete less frequent dictionary entries or give frequent entries shorter codes

How about this approach: • Repeat • for each letter pair occurring in the text, try: • replace the pair with a single new token • measure the total entropy (Huffman-compressed size) of the file • if that letter pair resulted in the greatest reduction in entropy so far, remember it • permanently substitute new token for the pair that caused the greatest reduction in entropy • until no more reductions in entropy are possible • Results: compression to about 25% for big books: better than gzip, zip. [But not as good as bzip!]

Compression other data • Modeling for audio? • Modeling for images?

Modeling for Images? Wikipedia

JPEG, etc. • Modeling: convert to the frequency domain with DCT • Throw away some high-frequency components • Throw away imperceptible components • Quantize coefficients • Encode the remaining coefficients with Huffman coding • Results: up to 20-1 compressionwith good results, 100-1 with recognizable results • How the DCT changed the world…

Data compression results Best algorithms compress text to 25% of original size, but humans can compress to 10% • Humans have far better modeling algorithms because they have better pattern recognition and higher-level patterns to recognize • Intelligence ≈ pattern recognition ≈ data compression? • Going further: Data-Compression.com

Ethical issues on algorithms • Back to an issue from the start of class: Can algorithms be unethical?

Chapter 9: Text Processing