1 / 19

Strings and regular expressions Day 10

Strings and regular expressions Day 10. LING 681.02 Computational Linguistics Harry Howard Tulane University. Course organization. http://www.tulane.edu/~ling/NLP/ NLTK is installed on the computers in this room! How would you like to use the Provost's $150?

laurel
Télécharger la présentation

Strings and regular expressions Day 10

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Strings and regular expressionsDay 10 LING 681.02 Computational Linguistics Harry Howard Tulane University

  2. Course organization • http://www.tulane.edu/~ling/NLP/ • NLTK is installed on the computers in this room! • How would you like to use the Provost's $150? • Please become a fan of Tulane Linguistics on Facebook. LING 681.02, Prof. Howard, Tulane University

  3. NLPP §3 Processing raw text §3.2 Strings: Text processing at the lowest level

  4. Syntax of single-line strings • Strings are specified with single quotes, or double quotes if a single quote is one of the characters: 'Monty Python' "Monty Python's Flying Circus" 'Monty Python\s Flying Circus' LING 681.02, Prof. Howard, Tulane University

  5. Syntax of multi-line strings • A sequence of strings can be joined into a single one with … • a backslash at the end of each line: 'first half'\ 'second half' = 'first halfsecond half' • parentheses to open and close the sequence: ('first half' 'second half') = 'first halfsecond half' • triple double quotes to open and close the sequence and maintain line breaks: """first half second half""" = 'first half/nsecond half' LING 681.02, Prof. Howard, Tulane University

  6. Basic opertions • Concatenation (+) • >>> 'really' + 'really' • 'reallyreally' • Repetition (*) • >>> 'really' * 4 • 'reallyreallyreallyreally' LING 681.02, Prof. Howard, Tulane University

  7. Your Turn p. 88 !!! LING 681.02, Prof. Howard, Tulane University

  8. Printing strings • Make a couple of string assignments: harry = 'Harry Potter' prince = 'Half-Blood Prince' • Inspection of a variable produces Python's representation of its value: >>> harry 'Harry Potter' • Printing a variable produces its value: >>> print harry Harry Potter • What do you expect? >>> print harry + prince >>> print harry, prince >>> print harry, 'and the', prince LING 681.02, Prof. Howard, Tulane University

  9. Using indices • Every character of a string is indexed from 0 (and -1) >>> harry[0] 'H' >>> harry[-1] 'r' >>> harry[:2] 'Har' >>> harry[-12:-10] 'Har' >>> for char in prince: ... print char, H a l f - B l o o d P r i n c e LING 681.02, Prof. Howard, Tulane University

  10. More string operations • See Table 3-2 LING 681.02, Prof. Howard, Tulane University

  11. Strings vs. lists • Both are sequences and so support joining by concatenation and separation by slicing. • But they are different, so they cannot be concatenated. • Granularity • Strings have a single level of resolution, the individual character > good for writing to screen or file. • Lists can have any level of resolution we want: character, morpheme, word, phrase, sentence, paragraph > good for NLP. • So the second step in the NLP pipeline is to tokenize a string into a list. LING 681.02, Prof. Howard, Tulane University

  12. NLPP §3 Processing raw text §3.3 Text processing with Unicode

  13. Unicode • The format for representing special characters that go beyond ASCII • Let's skip this until we really need it. LING 681.02, Prof. Howard, Tulane University

  14. NLPP §3 Processing raw text §3.4 Regular expressions for detecting word formats

  15. Getting started • To use regular expressions in Python, we need to import the re library. • We also need a list of words to search. • we'll use the Words Corpus again (Section 2.4). • We will preprocess it to remove any proper names. >>> import re >>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()] LING 681.02, Prof. Howard, Tulane University

  16. Different terminologies • In textbook, regex = «ed$» • In re, regex = 'ed$' (i.e. a string) LING 681.02, Prof. Howard, Tulane University

  17. Searching • re.search(p, s) • p is a pattern – what we are looking for, and • s is a candidate string for matching the pattern. LING 681.02, Prof. Howard, Tulane University

  18. Some examples • Find words ending in -ed: >>> [w for w in wordlist if re.search('ed$', w)] • Find a word that fits a certain group of blanks in a crossword puzzle that is 8 letters long, with j as the 3rd letter and t as the 6th letter: >>> [w for w in wordlist if re.search('^..j..t..$', w)] • Find the strings email or e-mail: >>> [w for w in wordlist if re.search('^e-?mail$', w)] LING 681.02, Prof. Howard, Tulane University

  19. Next time More on RegEx

More Related