Reducing E-Discovery CostÂ by Filtering Included Emails

Tsuen-Wan “Johnny” Ngan Symantec Research Labs Reducing E-Discovery Cost by Filtering Included Emails

The E-Discovery Problem • Email becomes core part of communications • Storage is a pain • Problem worsened by legislation like SOx • E-discovery: discovery of evidence from electronic data in civil litigation • Manually reviewed by lawyers • Time-consuming and expensive • Reduce this cost by reviewing fewer emails

The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation

The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation } Done once } Once per litigation

The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation Volume decreases Relevance increases

The E-Discovery Process Electronic Discovery Reference Model • Identification • Preservation • Collection • Processing • Review • Analysis • Production • Presentation

How to Filter? • Must be careful to not remove valuable evidence • The last email in an email thread often contain the whole conversation • (More?) likely in corporate environment between executives • Other emails can be ignored without affecting accuracy • Grouping "similar" emails can also expedite review

Basic Unit to Compare Emails • When is an email included? • The whole email in verbatim? • All sentences in any order? • Paragraph is chosen as a midpoint • "Idea" is usually preserved • Usually unmodified after quotation • Fewer of them for efficient comparisons

System Overview • Target use in a live email achieve system • Emails arrive over time • Need to find inclusion in both directions • Include any other email? • Included by any other email? • Given an email: • Find candidate emails by finding shared paragraphs • Bottleneck: Some paragraphs are shared by many emails • "Hi" "Thanks" "John" Ads disclaimers

Popular vs. Unpopular Paragraphs • Build two inverted indices • Unpopular paragraphs to emails • Popular paragraphs to emails • For emails with unpopular paragraphs • Only use these unpopular paragraphs to find candidates • For emails with only popular paragraphs • Need to compare with many candidates • But this is extremely rare!

Bloom Filters to Compare Subsets • A space-efficient data structure to test set membership • Extended to test for subsets • Fast way to filter false positives

Experiment Result Highlights • Data Sets: • Enron email trace (517k emails at 961MB)‏ • Mailing list discussion groups (487k emails at 680MB)‏ • Duplicated emails are removed in advance • ~20% of emails can be filtered • Processing speed: 2 to 4MB/s on commodity hardware • Scale reasonably well • Last 1% is only 40 to 50% slower than the first 1%

Summary • Observation: Emails usually contain unpopular paragraphs • Experiments shown a 20% reduction in emails • Huge cost saving for reviews • Computation time is fast enough for practical usage • Dividing popular and unpopular paragraphs is a special case • Could potentially divide into more levels • Benefit from finer granularity left as future work

Thank You!

Backup slides

Email Threads • Cannot simply use thread ids to find all threads • They may not always available • They may not be compatible • Threads != Inclusion • Emails in the same thread may not include each other • Emails in different threads may include each other • Still need to do all comparisons

Implementation Highlights • Remove email software generated text • Divide email into paragraphs • Hash alphanumerical characters in each paragraph • Remove formatting characters • Use Bloom filters for fast approximate subset test • Inverted index built (paragraph -> email)‏ • Popular paragraphs become bottleneck • Handle popular/unpopular paragraphs differently

Cannot Ignore Short Paragraphs • A short paragraph like "No" can carry important meaning • Ignoring them could lose important evidence

Reducing E-Discovery CostÂ by Filtering Included Emails