1 / 45

CL and Social Media

CL and Social Media. LING 575 Fei Xia Week 2: 01/11/2011. Outline. A few announcements Personal vs. Business email Email zone classification Deception detection Hw2 Hw1: quick update from the students. A few announcements. Databases on Patas.

chogan
Télécharger la présentation

CL and Social Media

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CL and Social Media LING 575 Fei Xia Week 2: 01/11/2011

  2. Outline • A few announcements • Personal vs. Business email • Email zone classification • Deception detection • Hw2 • Hw1: quick update from the students

  3. A few announcements

  4. Databases on Patas • Three mysql databases on patas/capuchin • enron: the ISI database • Same to have many more senders • The tables are slightly different from the paper • berkeley_enron: the database from Berkeley • zonerelease: email zone annotation • Query the database: • usrid: enronmail • password: askwhy

  5. Databases on Patas (cont) • mysql -u enronmail -p -h capuchin • enter your password (“askwhy”) • use database_name; • show tables; • select * from table_name limit 5; • mysql API for Perl and other languages

  6. Recent workshops on social media • NAACL 2010 workshop: http://www.aclweb.org/anthology-new/W/W10/W10-05.pdf • ACL 2011 workshop: (due date is 4/1) http://research.microsoft.com/en-us/events/lsm2011/default.aspx • International conference on Weblogs and Social Media: in conjunction with IJCAI-2011 (due date is 1/31) http://www.icwsm.org/2011/cfp.php

  7. Personal vs. business emails

  8. Task • Determine whether an email is personal or business • (Jabbari et al., 2006) • Manual annotation • Inter-annotator agreement • Automatic classification

  9. Annotated data • Available at http://staffwww.dcs.shef.ac.uk/people/L.Guthrie/nlp/research.htm • Stored on patas under$data_dir/personal_vs_business/ • Size: • 12,500 emails • 83% business, 17% personal • Mismatch between the paper and the data

  10. Class labels • Business: • core business, routine admin, inter-employee relations, soliciting, image, keeping_current • Personal: • close personal, personal maintenance, personal circulation

  11. Inter-annotator agreement • 2,200 emails are double annotated: • 6% disagreement • 82% are labeled as “business” by both • 12% are labeled as “personal” by both • disagreements: about 130 emails • 25% for subscription • 18% for travel arrangement • 13% for colleague meetings • 8% for service provided to Enron employees • Questions: • What do annotators see? The email only or the thread? Do they only look at the email body, or do they look at the “To” field as well?

  12. Automatic classification • Classification algorithm: (Guthrie and Walker, 1994) • Data: • 4,000 messages on “core business” • 1,000 messages on “close personal” • Results: 0.93 (system accuracy) vs. 0.94 (inter-annotator agreement)

  13. (Guthrie and Walker, 1994)Algorithm for text classification • Let T1, T2, …, Tk be class labels. • Assumption: a test document with class label Ti have similar “word” distributions with the union of training documents with Ti. • Training: • partition the set of words into W1, W2, …, Wm • for each Ti, • “merge” the documents in the training data whose class label is Ti • calculate pij for each Wj • Ex: |T|=2, |W|=3, pijis (0.1, 0.05, 0.85) for T1, and (0.01, 0.2, 0.79) for T2 • Testing: • let nj be the frequency of the words in the test document that belongs to Wj • Ex: the frequencies are (10, 200, 8900) • choose the Ti that maximizes

  14. (Guthrie and Walker, 1994):Experiments • Two class labels: T1 and T2 • Three word sets: W1, W2, and W3 • W1 includes the top 300 most frequent words in Docs(T1) that are not among the top 500 most frequent words in Docs(T2). • W2 includes the top 300 most frequent words in Docs(T2) that are not among the top 500 most frequent words in Docs(T1). • W3 includes the rest of the words • Accuracy: 100%

  15. Issues • Using word features: the words in a business email could vary a lot depending on what the business is. • Other important cues: • the relation between the sender and the recipient • Do they work in the same company? • What is the path between them in the company report chain? • Are they friends? • other emails in the same thread • the nature of the sender/recipient/company’s work and the words in the emails (e.g., “stock”, “parent meeting”) • … • Other ideas?

  16. Email zoning

  17. Email zone classification • Task: given a message, break it down to zones (e.g., header, greeting, body, disclaimer, etc.) • Today’s paper: Andrew Lampert, Robert Dale, and Cecile Paris, 2009. Segmenting Email Message Text into Zones. In Proc. of EMNLP-2009 • Data: • Available at http://zebra.thoughtlets.org/ • Stored on patas under $data_dir/email_zoning_dataset/EmailZoneData/ • Stored on capuchin as a mysql database called “zonerelease”

  18. Email zones in (Estival et al., 2007) • Five categories: • Author text • Signature • Advertisement (automatically appended ones) • Quoted text • Reply lines

  19. Email zones in (Lampert et al., 2009) • Sender zones • Author: new content from the current email sender, excluding any text that has been included from previous messages. • Greetings: e.g., “Hi, Mike” • Signoff: e.g., “thanks. AJ” • Quoted conversation zones • Reply: content quoted from a previous message • Forward: Content from an email message outside the current conversation thread that has been forwarded by the current email sender

  20. Email zones (cont) • Boilerplate zones: Boilerplate zones contain content that is reused without modification across multiple email messages • Signature • Advertising • Disclaimer • Attachment: automatically generated text

  21. Manual annotation • Annotated data: • almost 400 email messages • 11881 lines (7922 non-blank lines) • use the Berkeley database (“berkeley_enron”) • one annotator • Use 10-fold cross validation

  22. Automatic classification • Classifier: SVM • Two approaches: • two stages: (zone fragment classification) • segment a message into zone fragments • classify those fragments • one stage: • classify each line

  23. Detecting zone boundaries • Different kinds of boundaries: • Blank boundaries: line 12 • Separate boundaries: line 17-20 • Adjoining boundaries: lines 10 and 11 • Use heuristic approach: • consider every blank line or lines beginning with 4+ repeated punctuation marks • cannot handle adjoining boundaries • high recall, low precision

  24. Classifying zone fragments • Features: • Graphic features: layout of text in the email • Orthographic features: the use of distinctive chars and char sequences including punctuation, capital letters and numbers • Lexical features: information about the words used in the email text

  25. Graphic features • the number of words in the text fragment • the number of characters in the text fragment • the start position of the text fragment • the end position of the text fragment • the average line length (in chars) within the text fragement • the length of the text fragment relative to the previous fragment • the number of blank lines preceding the text fragement • …

  26. Orthographic features • whether all lines start with the same character (e.g., ‘>’); • whether a prior text fragment in the message contains a quoted header; • whether a prior text fragment in the message contains repeated punctuation characters; • whether the text fragment contains a URL; • whether the text fragment contains an email address; • whether the text fragment contains a sequence of four or more digits; • the number of capitalised words in the text fragment; • the percentage of capitalised words in the text fragment; • …

  27. Lexical features • word unigram • word bigram • whether the text fragment contains the sender’s name; • whether a prior text fragment in the message contains the sender’s name; • whether the text fragment contains the sender’s initials; and • whether the text fragment contains a recipient’s name.

  28. Results

  29. Confusion matrix for nine-zone line classification

  30. Precision and recall

  31. Issues • Sequence labeling problem: • add features that look at the labels of preceding segments • Is the 9-zone label set sufficient? • How to take advantage of emails in the bigger context? • emails in the same discussion thread • emails by the same sender • general email structure: e.g., greeting, body, signoff, etc.

  32. Deception detection

  33. Papers for today • [11] M.L. Newman, J.W. Pennebaker, D.S. Berry, and J.M. Richards. “Lying words: Predicting deception from linguistic style”. Personality and Social Psychology Bulletin, 29:665–675, 2003. • [13] L. Zhou, J.K. Burgoon, J.F. NunamakerJr, and D. Twitchel, 2004. “Automating linguistics-based cues for detecting deception in text-based asynchronous computer-mediated communication”. Group Decision and Negotiation, 13:81–106, 2004.

  34. (Newman et al., 2003) • Assumptions: Deceptive communications should be characterized by • fewer first-person singular pronouns (e.g., “I”, “me”, and “my”): disassociate one from one’s statements • more words reflecting negative emotion: feel guilt about lying or about the topic they are discussing • fewer "exclusive" words (e.g., “except”, “but”, “without”) and more action words (e.g., “walk”): due to the reduce of cognitive resources

  35. Experiments: Five studies • videotaped abortion attitudes • typed abortion attitudes • handwritten abortion attitudes • feelings about friends • mock crime

  36. Experiments • Trained on four studies and used the "classifier" on the remaining study • Accuracy: about 61% • They found these four types of words have the weights consistent with their assumptions.

  37. (Zhou et al., 2004) • Experiments: • students are asked to exchange emails about a desert survival task • students are asked to tell the truth or lies • features: 27 linguistic cues

  38. Hypothesis • Deceptive senders display • higher (a) quantity, (b) expressivity, (c) positive affect, (d) informality, (e) uncertainty, and (f) nonimmediacy, and • less (g) complexity, (h) diversity, and (i) specificity of language in their messages than truthful senders and than their respective receivers

  39. Linguistics cues • quality: • # of words • # of verbs • # of NPs • # of sentences • expressivity: • # of adj/adv divided by # of nouns and verbs

  40. Linguistics cues (cont) • positive effect: expression of positive emotion • informality: # of misspelled words / # of words • uncertainty: • # of modifiers (adj/adv) • # of modal verbs • # of uncertainty words • # of third person pronouns

  41. linguistic cues (cont) • nonimmediacy: • passive voice • generalizing terms • (fewer) self references • group references: first person plural pronouns

  42. Linguistic cues • Complexity: • Ave # of clauses per sent • Ave sentence length • Ave word length • … • Diversity: • lexical diversity • content word diversity • redundancy • …

  43. Issues • Different settings for deceptions could affect the cues (e.g., length of the messages): • interviews • emails • blogs • lie or asked to lie

  44. Hw2 • Your presentation • Reading assignments • Suggestions for others’ projects

More Related