1 / 12

Leveraging Regular Expressions for Text Manipulation and Email Filtering

This document explores the power of regular expressions in dealing with untrustworthy input and data display patterns. It emphasizes the use of regex in command-line tools and discusses practical applications like web searches, email filtering, and text manipulation in programming languages such as Java and Perl. Key concepts such as pattern matching, quantifiers, and capturing groups are illustrated using examples from email subject lines and string matching. The versatility of regex in finding relevant patterns in data is highlighted, showcasing its importance in modern programming tasks.

Télécharger la présentation

Leveraging Regular Expressions for Text Manipulation and Email Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regular expressions CS201 Fall 2004 Week 11

  2. Problem • input is very untrustworthy • stack smashing, for example • lots of data display patterns • can we combine these two insights? • yes- regular expressions

  3. Example • command line: dir *.java • Boo.java Fred.java PainfulClass.java • Displays all the java programs in the directory • * - Kleene closure

  4. RE and pattern matching • Web searches • email filtering • text-manipulation (Word) • Perl

  5. How do we use it? • import java.util.regex.*; • specify a pattern • compile it • match • iterate

  6. Specifying Patterns • strings: "To: cwm2n@spamgourmet.com" • can match case exactly • or match case insensitive • Range • [01234567] – any symbol inside the [] • [0-9] • [^j] – caret means "anything BUT j" • one symbol: • . – period manys any character • \\d – a digit, e.g.: [0-9] • \\D – a non-digit [^0-9] • \\w – character, part of a word [a-zA-Z_0-9]

  7. Patterns • quantifier- how many times • * - any number of times (including zero) • .* • ? – zero or one time • A? - A zero or one time • + one or more times • A+ - must find at least one A • others (p. 476)

  8. examples • find subject line of email • "Subject: .*" • finds: Subject: weather • finds: Subject: [POSSIBLE SPAM] get a degree! • Problem • also finds • How to be a British Subject: marry into the Royal

  9. Anchors • tell us where to find what we are looking for • ^ - beginning of line • ^Subject: .* • $ - end of line • ^com • others on page 478

  10. Alternation • subject line either SPAM or Rolex • ^Subject:.*(SPAM.* | Rolex.*)

  11. How to use it, really • Form a pattern • Pattern p = Pattern.compile("^Subject: .*"); • Create a Matcher • Matcher m = p.matcher(someBuffer); • iterate while(m.find()) System.out.println("Found text: "+m.group()); • find()- boolean, next occurence found • group() – String that matches

  12. example package edu.virginia.cs.cs201.fall04; import java.util.regex.*; public class Tryout { String text = "A horse is a horse, of course of course.."; String pattern = "horse|course"; public static void main(String args[]) { Tryout t = new Tryout(); t.go(); } public void go() { Pattern p = Pattern.compile(pattern); Matcher m = p.matcher(text); while(m.find()) { System.out.println(m.group()+m.start()); } } } horse2 horse13 course23 course33

More Related