1 / 25

Data Mining of E-Mails to Support Periodic & Continuous Assurance

5th Symposium on Information Systems Assurance. Data Mining of E-Mails to Support Periodic & Continuous Assurance. Glen L. Gray California State University at Northridge Roger Debreceny University of Hawai`i at M ā noa. Toronto: October 2007. In this Presentation.

Télécharger la présentation

Data Mining of E-Mails to Support Periodic & Continuous Assurance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 5th Symposium on Information Systems Assurance Data Mining of E-Mails to Support Periodic & Continuous Assurance Glen L. GrayCalifornia State University at NorthridgeRoger DebrecenyUniversity of Hawai`i at Mānoa Toronto: October 2007

  2. In this Presentation • Continuous monitoring of emails – why? • Technologies • Social Network Analysis • Text analysis • Challenges • Opportunities

  3. Continuous Monitoring of Emails – Why? • Increased focus on forensic approaches to auditing • Increased interest in continuous assurance and monitoring of business processes • Emails = Organization’s DNA • Evidential matter on: • Employee & management fraud (overrides) • Compliance (e.g., HIPAA) • Loss of intellectual property • Corporate policies

  4. Enron Email Archive • Released by Federal Energy Regulatory Commission • 500K emails • 151 Enron employees • Cleaned version at Carnegie Mellonwww.cs.cmu.edu/~enron/ • Relational DB version at USCwww.isi.edu/~adibi/Enron/Enron_Dataset_Report.pdf

  5. Email Mining Targets

  6. Content Analysis

  7. Key Word Queries • Yes, people do say self-incriminating things in their emails • Fraud • Corporate dysfunction • Overwhelming false positives • Need “smart” compound queries • Good continuous auditing (CA) candidate • Already scanning for spam, porn, etc.

  8. Sender Deception -- Content • Deceptive emails include: • Fewer first-person pronouns to dissociate themselves from their own words • Fewer exclusive words, such as but and except, to indicate a less complex story • More negative emotion words because of the sender’s underlying feeling of guilt • More action verbs to, again, indicate a less complex story

  9. Sender Deception -- Identification • Writeprint features • Lexical -- characters & words • Function words • Root words • Syntactic -- sentences • Structural -- paragraphs • Content-specific

  10. Sender Deception -- Identification • Number of potential features unlimited • Optimum number can vary bycontext and language • Developing user profiles and comparing new emails to profiles would be challenging for real-time CA

  11. Temporal/Log Analysis

  12. Volume & Velocity • Volume = number of emails a person sends and/or receives over a period of time. • Velocity = how quickly the volume changes. • Many external factors (e.g., vacations, seasonal activities, etc.) impact these numbers • Need “rolling histogram”

  13. Volume & Velocity • Key issue -- determining the optimum time intervals to sample the data • Continuous monitoring cannot be continuous in terms of sampling in real time • Comparing hourly, daily, and even weekly volumes and velocities will result in many false positives • Optimum time internal could vary by job title

  14. Social Network Analysis

  15. Social Network Analysis • Social relationships as an undirected graph • Importance of understanding relationships within the flow of email exchanges

  16. Social Network Analysis in Emails • Emails semi-structured data • sender • primary recipient(s) • copied recipient(s) • date • subject line • Social groups and cliques • CA = who doesn’t belong?

  17. C C C C R S R C C C C R S R S Time Thread Analysis – This? S

  18. C S R R R C C C S R R R S Time Thread Analysis – Or this? S

  19. Integrating Content Analysis and Social Network Analysis

  20. Challenges of Email Mining • Textual • Inconsistent use of abbreviations • Misspelled words • Smileys etc. etc. • Replies, replies, and more replies… • Inability to identify: • Identities of email participants • anon@anon.mail.sender.net • Roles and responsibilities

  21. What Enron Emails Show? • People do say the darnest things • What did he know and when did he know it? • Verified numerous bodies of email data mining research • Content analysis • Social network analysis

  22. Tools • Content monitoring • eSoft Corporation’s ThreatWall • Symantec’s Mail Security 8x00 Series • Vericept Corporation’s Vericept Content 360º • Reconnex Corporation’s iGuard Appliance • InBoxer, Inc. Anti-Risk Appliance • Social networks • Microsoft SNARF • Heer Vizter

  23. Research Opportunities

  24. Research Questions • Role of email monitoring in overall CA environment? • Join SNA with examination of textual patterns. • Link SNA with control environment • Frauds/control overrides footprint? • What email cleaning is required for CA purposes? • Privacy and policy issues? • Lessons from existing commercial products?

  25. Your Questions Thank Youglen.gray@csun.edu rogersd@hawaii.edu

More Related