Semantic Analysis for E-Discovery Metadata Generation
Explore human-generated document metadata for organizational insights. Analyze semantic and syntactic document content to automate metadata generation. Utilize the Illinois Institute of Technology CDIP dataset for testing and refining the process.
Semantic Analysis for E-Discovery Metadata Generation
E N D
Presentation Transcript
Concepts, Semantics and Syntax in E-Discovery David Eichmann Institute for Clinical and Translational Science The University of Iowa
Our Approach • Analyze the human-generated metadata available for document collections for organizational and individual interactions • Explore the syntactic and semantic nature of document content and the potential for automatic generation of metadata • Explore the concept space generated by the previous step and its correspondence to boolean predicate specification in discovery
Our Target Corpus • The Illinois Institute of Technology Complex Document Information Processing Test Collection (IIT CDIP), v. 1.0 • Derived from the tobacco master settlement agreement • Comprises 6,910,192 ‘documents’ • Or more properly the OCR output from those documents • Two merged XML tag sets of metadata, with overlapping content • <A> • <LTDLWOCR>
Database Schema • We map the XML structure to a set of relational database tables • Non-recurring fields are collected in a table named ‘document’ • docid • title • description • OCR text • Recurring elements each get a table • docid • value
How Many Reininghaus? • Reininghaus,R • Reininghaus,W
Semantics and Structure • Our analysis of content involves the following phases: • Lexical analysis • Sentence boundary detection • Named entity recognition • Sentence parsing • Relationship extraction • The nature of the OCR data seriously impacts each of the phases (sometimes in different ways)
Next Steps • Experiment with custom lexical analysis of the OCR • Start with simple white space detection • Construct a lexicon and look for out-of-band vocabulary as OCR errors candidates • Rewrite the analyzer to support OCR error correction • Sentence boundary detect and parse the full corpus • Generate entity relationships using our question answering framework
And Beyond That… • Return to the document images and analyze document layout • Regenerate OCR to include token coordinates • Use our PDF structure extraction framework to generate logical document structure • Generate a set of document models based upon similar layout • Use the document models to map OCR text to metadata elements