1 / 91

Introduction to Classification

Introduction to Classification. Shallow Processing Techniques for NLP Ling570 November 9, 2011. Roadmap. Classification problems: Definition Solutions Case studies. Based on slides by F. Xia. Example: Text Classification. Task: Given an article, predict its category Categories:.

Télécharger la présentation

Introduction to Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction toClassification Shallow Processing Techniques for NLP Ling570 November 9, 2011

  2. Roadmap • Classification problems: • Definition • Solutions • Case studies Based on slides by F. Xia

  3. Example: Text Classification • Task: • Given an article, predict its category • Categories:

  4. Example: Text Classification • Task: • Given an article, predict its category • Categories: • Sports, entertainment, news, weather,.. • Spam/not spam

  5. Example: Text Classification • Task: • Given an article, predict its category • Categories: • Sports, entertainment, news, weather,.. • Spam/not spam • What kind of information is useful for this task?

  6. Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C

  7. Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class

  8. Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class • Data: set of instances • labeled data: y is known • unlabeled data: y is unknown

  9. Classification Task • Task: • C is a finite set of labels (aka categories, classes) • Given x, determine its category y in C • Instance: (x,y) • x: thing to be labeled/classified • y: label/class • Data: set of instances • labeled data: y is known • unlabeled data: y is unknown • Training data, test data

  10. Text Classification Examples • Spam filtering • Call routing • Sentiment classification • Positive/Negative • Score: 1 to 5

  11. POS Tagging • Task: Given a sentence, predict tag of each word • Is this a classification problem?

  12. POS Tagging • Task: Given a sentence, predict tag of each word • Is this a classification problem? • Categories: N, V, Adj,… • What information is useful?

  13. POS Tagging • Task: Given a sentence, predict tag of each word • Is this a classification problem? • Categories: N, V, Adj,… • What information is useful? • How do POS tagging, text classification differ?

  14. POS Tagging • Task: Given a sentence, predict tag of each word • Is this a classification problem? • Categories: N, V, Adj,… • What information is useful? • How do POS tagging, text classification differ? • Sequence labeling problem

  15. Word Segmentation • Task: Given a string, break into words • Categories:

  16. Word Segmentation • Task: Given a string, break into words • Categories: • B(reak), NB (no break) • B(eginning), I(nside), E(nd) • e.g. c1 c2 || c3 c4 c5

  17. Word Segmentation • Task: Given a string, break into words • Categories: • B(reak), NB (no break) • B(eginning), I(nside), E(nd) • e.g. c1 c2 || c3 c4 c5 • c1/NB c2/B c3/NB c4/NB c5/B • c1/B c2/E c3/B c4/I c5/E • What type of task?

  18. Word Segmentation • Task: Given a string, break into words • Categories: • B(reak), NB (no break) • B(eginning), I(nside), E(nd) • e.g. c1 c2 || c3 c4 c5 • c1/NB c2/B c3/NB c4/NB c5/B • c1/B c2/E c3/B c4/I c5/E • What type of task? • Also sequence labeling

  19. Solving a Classification Problem

  20. Two Stages • Training: • Learner: training data  classifier

  21. Two Stages • Training: • Learner: training data  classifier • Testing: • Decoder: test data + classifier  classification output

  22. Two Stages • Training: • Learner: training data  classifier • Testing: • Decoder: test data + classifier  classification output • Also • Preprocessing • Postprocessing • Evaluation

  23. Representing Input • Potentially infinite values to represent

  24. Representing Input • Potentially infinite values to represent • Represent input as feature vector • x=<v1,v2,v3,…,vn> • x=<f1=v1,f2=v2,…,fn=vn>

  25. Representing Input • Potentially infinite values to represent • Represent input as feature vector • x=<v1,v2,v3,…,vn> • x=<f1=v1,f2=v2,…,fn=vn> • What are good features?

  26. Example I • Spam Tagging • Classes: Spam/Not Spam • Input: • Email messages

  27. Doc1 Western Union Money Transfer office29@yahoo.com.phOne Bishops Square Akpakpa E1 6AO, CotonouBenin RepublicWebsite: http://www.westernunion.com/ info/selectCountry.asPPhone: +229 99388639Attention Beneficiary,This to inform you that the federal ministry of finance Benin Republic has started releasing scam victim compensation fund mandated by United Nation Organization through our office.I am contacting you because our agent have sent you the first payment of $5,000 for your compensation funds total amount of $500 000 USD (Five hundred thousand united state dollar)We need your urgent response so that we shall release your payment information to you.You can call our office hot line for urgent attention(+22999388639)

  28. Doc2 • Hello! my dear. How are you today and your family? I hope all is good,kindly pay Attention and understand my aim of communicating you todaythrough this Letter, My names is Saif al-Islam  al-Gaddafi the Son offormer  Libyan President. i was born on 1972 in Tripoli Libya,By Gaddafi’ssecond wive.I want you to help me clear this fund in your name which i deposited inEurope please i would like this money to be transferred into your accountbefore they find it.the amount is 20.300,000 million GBP British Pounds sterling through a

  29. Doc3 • from: web.25.5.office@att.net • Apply for loan at 3% interest Rate..Contact us for details.

  30. Doc4 • from: acl@aclweb.org • REMINDER:If you have not received a PIN number to vote in the elections and have not already contacted us, please contact either DragoRadev (radev@umich.edu) or Priscilla Rasmussen (acl@aclweb.org) right away.Everyone who has not received a pin but who has contacted us already will get a new pin over the weekend.Anyone who still wants to join for 2011 needs to do this by Monday (November 7th) in order to be eligible to vote.And, if you do have your PIN number and have not voted yet, remember every vote counts!

  31. What are good features?

  32. Possible Features • Words!

  33. Possible Features • Words! • Feature for each word

  34. Possible Features • Words! • Feature for each word • Binary: presence/absence • Integer: occurrence count • Particular word types: money/sex/: [Vv].*gr.*

  35. Possible Features • Words! • Feature for each word • Binary: presence/absence • Integer: occurrence count • Particular word types: money/sex/: [Vv].*gr.* • Errors: • Spelling, grammar

  36. Possible Features • Words! • Feature for each word • Binary: presence/absence • Integer: occurrence count • Particular word types: money/sex/: [Vv].*gr.* • Errors: • Spelling, grammar • Images

  37. Possible Features • Words! • Feature for each word • Binary: presence/absence • Integer: occurrence count • Particular word types: money/sex/: [Vv].*gr.* • Errors: • Spelling, grammar • Images • Header info

  38. Representing Input:Attribute-Value Matrix

  39. Representing Input:Attribute-Value Matrix

  40. Representing Input:Attribute-Value Matrix

  41. Representing Input:Attribute-Value Matrix

  42. Representing Input:Attribute-Value Matrix

  43. Classifier • Result of training on input data • With or without class labels

  44. Classifier • Result of training on input data • With or without class labels • Formal perspective: • f(x) =y: x is input; y in C

  45. Classifier • Result of training on input data • With or without class labels • Formal perspective: • f(x) =y: x is input; y in C • More generally: • f(x)={(ci,scorei)}, where • x is input, • ci in C, • scoreiis score for category assignment

  46. Testing • Input: • Test data: • e.g. AVM • Classifier • Output:

  47. Testing • Input: • Test data: • e.g. AVM • Classifier • Output: • Decision matrix • Can assign highest scoring class to each input

  48. Testing • Input: • Test data: • e.g. AVM • Classifier • Output: • Decision matrix • Can assign highest scoring class to each input

  49. Testing • Input: • Test data: • e.g. AVM • Classifier • Output: • Decision matrix • Can assign highest scoring class to each input

  50. Evaluation • Confusion matrix: • Precision: TP/(TP+FP)

More Related