LING/C SC/PSYC 438/538

LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong

Administrivia • Homework • out today • Due next Monday (September 20th) by midnight

Shortest vs. Greedy Matching • default behavior • in Perl RE match: longest possible matching string • aka “greedy matching” • This behavior can be changed, see following slide • RE search is supposed to be fast • but searching is not necessarily proportional to the length of the input being searched • in fact, Perl RE matching can can take exponential time (in length) • non-deterministic • may need to backtrack (revisit) if it matches incorrectly part of the way through linear time time length length exponential

Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*)bar/ ) { print "got <$1>\n"; } • Notes: • $_ is the default variable for matching • $1 refers to the parenthesized part of the match (.*) • Output: • got <d is under the bar in the >

Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html • Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print "got <$1>\n"; } • Notes: • ? immediately following a repetition operator like * makes the operator work in non-greedy mode • Output: • got <d is under the >

Split • @array = split /re/, string • splits string into a list of substrings split by re. Each substring is stored as an element of @array. • Examples (from perlrequick tutorial):

Split • More examples: m!re! (using ! – or some other character - as a RE delimiter) Is equivalent to /re/

Range Abbreviations: period (.) stands for any character (except newline) \d (digit) = [0-9] \s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF) \w (word character) = [0-9a-zA-Z_] uppercase versions, e.g. \D and \W denote negation... Line-oriented metacharacters: caret (^) at the beginning of a regexp string matches the “beginning of a line” dollar sign ($) at the end of a regexp string matches the “end of the line” Word-oriented metacharacters: a word is any sequence of digits [0-9], underscores (_) and letters [a-zA-Z] \b matches a word boundary could be the beginning of line, a whitespace character, etc. Words and Lines

Homework • Theme: dealing with raw text • File: data/written_1/journal/slate/3/Article247_499.txt • (ANC – American National Corpus: 100 million words) • Genre: journal, (Slate Magazine article from 1998) • Sample: • Really Juvenile Reynolds • USA • Today and the Washington Post lead with revelations from newly disclosed • R.J. Reynolds internal documents that seem to show that the company has • persistently attempted to market cigarettes to teens. This is also the top • national story at the Los Angeles Times . The New York Times • leads with the U.N. Security Council's vote telling Iraq to honor previous • promises to allow U.N. inspectors complete access to suspected weapons • sites. • The new tobacco documents (many of them marked "Secret"), released as part • of a lawsuit settlement, show a company strategy of attracting teenagers • through advertising and various youth-oriented promotions such as, according to • USAT , "NASCAR sponsorship," "inner city activities," and "T-shirts and • other paraphernalia." And says USAT , the documents show that RJR's • introduction of "Joe Camel" fits in to this strategy.

Homework • One of the first steps in processing raw text is to clean and mark it up (xml) • Task 1 438/538 (15pts) • write a Perl program that counts the number of paragraphs and sentences for Article247_499.txt (download from class webpage) • See next slide for output format • Discuss what the technical problems are with sentence boundary markup and describe your solution. • e.g. what regular expressions you are going to use • Submit your program and its output on Article247_499.txt

Homework Help • Useful code fragment • use previously described template: open($txtfile,$ARGV[0]) or die "$ARGV[0] not found!\n"; while ($line = <$txtfile>) { do RE stuff with $line } • Example: perlprocessfile.plArticle247_499.txt

Homework Help • <$line> reads in a line of text including the newline (\n) character • so lines are one character longer than you might think • The real world is messy • Article247_499.txt is not quite uniform: sentences are split across lines, it may contain extra whitespace and invisible characters you can’t see with a regular text editor. • The file Article247_499.txt you are given is actually not quite raw text • I’ve pre-converted it to ASCII (UTF-8) for you to make life a bit easier • Original was in UTF-16 (big-endian) with nasty non-printable BOM (U+FEFF) and null characters

Homework Help • You will need to determine how you’re going to pattern match paragraph separators and end of sentences. Input Delimiter http://www.bayview.com/blog/2002/07/29/input-delimiter/

Homework • Sample: • Really Juvenile Reynolds • USA • Today and the Washington Post lead with revelations from newly disclosed • R.J. Reynolds internal documents that seem to show that the company has • persistently attempted to market cigarettes to teens. This is also the top • national story at the Los Angeles Times . The New York Times • leads with the U.N. Security Council's vote telling Iraq to honor previous • promises to allow U.N. inspectors complete access to suspected weapons • sites. • The new tobacco documents (many of them marked "Secret"), released as part • of a lawsuit settlement, show a company strategy of attracting teenagers • through advertising and various youth-oriented promotions such as, according to • USAT , "NASCAR sponsorship," "inner city activities," and "T-shirts and • other paraphernalia." And says USAT , the documents show that RJR's • introduction of "Joe Camel" fits in to this strategy. Note: Assume blank lines separate paragraphs Output Format Paragraph 1: No. of sentences: 1 Paragraph 2: No. of sentences: 3 Paragraph 3: No. of sentences: 3 etc. paragraph paragraph

Homework • Task 2 438/538 (15pts) • Modify your Perl program to produce xml paragraph and sentence boundary markup for Article247_499.txt • i.e. produces reformatted raw text as • <p> • <s>sentence 1</s> • <s>sentence 2</s> • </p> … • Each <s>..</s> should occupy exactly one line of your output. • Leading and trailing spaces of a sentence should be deleted, e.g. • <s> The new tobacco … • vs. <s>The new tobacco … • Submit your program and its output on Article247_499.txt (Cut and paste everything from both tasks into one file for submission)

LING/C SC/PSYC 438/538