Semantic Analysis for E-Discovery Metadata Generation

Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa

Our Approach • Analyze the human-generated metadata available for document collections for organizational and individual interactions • Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata • Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery

Our Target Corpus • The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0 • Derived from the tobacco master settlement agreement • Comprises 6,910,192 ‘documents’ • Or more properly the OCR output from those documents • Two merged XML tag sets of metadata, with overlapping content • <A> • <LTDLWOCR>

Metadata Entity Frequencies

Database Schema • We map the XML structure to a set of relational database tables • Non-recurring fields are collected in a table named ‘document’ • docid • title • description • OCR text • Recurring elements each get a table • docid • value

Identifying an Individual

How Many Reininghaus? • Reininghaus,R • Reininghaus,W

Co-mention Connections

Co-mention Affiliations

Semantics and Structure • Our analysis of content involves the following phases: • Lexical analysis • Sentence boundary detection • Named entity recognition • Sentence parsing • Relationship extraction • The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)

CDIP Parse Tree Complexity

Clean Text Parse Tree Complexity

Next Steps • Experiment with custom lexical analysis of the OCR • Start with simple white space detection • Construct a lexicon and look for out-of-band vocabulary as OCR errors candidates • Rewrite the analyzer to support OCR error correction • Sentence boundary detect and parse the full corpus • Generate entity relationships using our question answering framework

And Beyond That… • Return to the document images and analyze document layout • Regenerate OCR to include token coordinates • Use our PDF structure extraction framework to generate logical document structure • Generate a set of document models based upon similar layout • Use the document models to map OCR text to metadata elements

For Example

Semantic Analysis for E-Discovery Metadata Generation

Semantic Analysis for E-Discovery Metadata Generation

Presentation Transcript

Java Syntax and Semantics

Syntax and Semantics

Syntax and Semantics

(SYNTAX-DRIVEN) SEMANTICS

Syntax and Semantics

Syntax and Semantics

Syntax Simple Semantics

Introduction: syntax and semantics

Syntax and Semantics

SYNTAX and SEMANTICS

Syntax and Semantics

Semantics 2: Syntax-Semantics Interface

Implementation, Syntax, and Semantics

Syntax and Semantics

Syntax != Semantics

Syntax-Semantics Mapping

C++ Syntax and Semantics

Syntax Simple Semantics

Syntax versus Semantics

Syntax and Semantics

Describing Syntax and Semantics