200 likes | 303 Vues
Explore how data mining techniques were applied to document imaging project, including problem definition, solution methodology, tools used, and management outcomes. Learn lessons from the project and get insights on data mart and classification models.
E N D
Data Mining Applied to Document Imaging Jeff Rekoske
Agenda • Introduction • Problem Definition • Solution and Methodology • Progress Report • Tools • Techniques Applied from CSC-288 • Lessons Learned/Reinforced • Summary
Introduction • Employed as SW Developer and DBA on document imaging project • Access to OCR statistics • Management staff has a few questions that can be answered by analysis of existing data
Problem Definition • Two Parts • Management questions • Data mining demonstration
Management Questions • Result of interviews • Fairly basic • What forms are processed the most? • What are the recognition rates for the top forms? • What is the percentage of forms that were presented to an operator for keying?
Data Mining Demonstration • Purpose is to show the usefulness of data mining techniques. • Prediction of rates for new forms • Characteristics of highly recognized forms • Use mined data to develop new forms
Solution • Data mart • Answer management questions • Provide data for mining activities
Methodology • Choose a small timeframe to sample data • September – October 2004 • Use ETL to load data • Relatively “clean” process due to data location • Apply SQL statements to data mart to answer management questions
Methodology (continued) • Extract data from data mart to create WEKA files • Attribute-Relation File Format (ARFF) • Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition) • Validate model with 10-fold cross validation
Progress Report • First part (management questions) complete • 14,210 imaged documents • 865,409 OCR fields • View created that joins tables • Allows for non-technical personnel to create basic queries • Management is pleased with results
Progress Report (continued) • Part Two (WEKA –classifier) in progress • ARFF generation scripts complete • Need to run ARFF files through WEKA • Need to cross validate results
Tools • Oracle 8i RDBMS • Oracle PL/SQL scripting language • WEKA implementation of C4.5 classifier • WEKA cross validation
Techniques Applied from CSC-288 • Data Mart • Snowflake Schema • ETL • OLAP Operations
Techniques Applied (continued) • Classification • C4.5 Algorithm • Supervised Learning • Credibility • Cross-Validation
Lessons Learned/Reinforced • Get firm requirements (if possible) • Data marts can get large quickly • OLAP operations should be performed offline (from the OLTP system) • Demonstrations are useful for explaining concepts
Summary • Application of knowledge from CSC-288 to my work • Data mart can be used to answer multiple questions without effecting OLTP processing • Hopefully demonstrate using the data mart for creating a classification model
References • “Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001 • "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000.