1 / 13

Regular expressions { week 04}

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Regular expressions { week 04}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0.

maegan
Télécharger la présentation

Regular expressions { week 04}

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Regular expressions{week 04} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

  2. Storage and retrieval • Computers store and retrieve information • Retrieval first requires finding information once we find the data, we often mustextract what we need...

  3. Identifying traffic patterns • Weblogs record each and everyaccess to the Web server • Use the data to answer questions • Which pages are the most popular? • How much spam is the site experiencing? • Are certain days/times busier than others? • Are there any missing pages (bad links)? • Where is the traffic coming from?

  4. Weblogs (not blogs!) • Apache records an access_log file: • 75.194.143.61 - - [26/Sep/2011:22:38:12 -0400] "GET /cis460/wordfreq.php HTTP/1.1" 200 566 requesting IP (or host) username/password access timestamp HTTP request server response code size in bytes of data returned (for server response codes, see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

  5. What do we do with the data? • We have many options for using the data summarized file (e.g. csv or tsv) spreadsheet access_log database web site

  6. How do we process the data? • Regardless of what we do with the data,we must first parse or extract the data • We could write specific code to processthe data and programmatically extract the desired information • Use regular expressions to simplify processing

  7. Regular expressions (i) • A regular expression is an expression ina “mini language” designed specificallyfor textual pattern matching • Support for regular expressions are availablein many languages, including Java, JavaScript,C, C++, PHP, etc.

  8. Regular expressions (ii) • A pattern contains numerous character groupings and is specified as a string • Patterns to match a phone number include: • [0-9][0-9][0-9]−[0-9][0-9][0-9]−[0-9][0-9][0-9][0-9] • [0-9]{3}−[0-9]{3}−[0-9]{4} • \d\d\d−\d\d\d−\d\d\d\d • \d{3}−\d{3}−\d{4} • (\d\d\d) \d\d\d−\d\d\d\d

  9. Regular expressions (iii)

  10. Regular expressions (iv)

  11. Regular expressions in Java (i) • The String class in Java provides a pattern matching method called matches(): • Unlike other languages, Java requires the pattern to match the entire string String s = "Pattern matching in Java!"; String p = "\\w+\\s\\w+\\s\\w{2}\\s\\w+!"; if ( s.matches( p ) ) { System.out.println( "MATCH!" ); }

  12. Regular expressions in Java (ii) • Additional pattern-matching methods: • Use the replaceFirst() and replaceAll() methods to replace a pattern with a string: String s = "<title>Cool Web Site</title>"; String p = "</?\w+>"; String result = s.replaceAll( p, "" );

  13. Regular expressions in Java (iii) • Additional pattern-matching methods: • Use the split() method to split a stringinto an array of substrings String s = "The Legend of Sleepy Hollow"; String[] words = s.split( "\\s+" );

More Related