200 likes | 378 Vues
Knowledge Engineering for Document Analysis and Classification. Al Klein, BWXT Y-12, LLC Charles Wilson, AreteQ, Inc. . October 28, 2004. Outline for Presentation. Introduction and Background Capabilities and use of Ferret expert system for classification support
E N D
Knowledge Engineering for Document Analysis and Classification Al Klein, BWXT Y-12, LLC Charles Wilson, AreteQ, Inc. October 28, 2004
Outline for Presentation • Introduction and Background • Capabilities and use of Ferret expert system for classification support • Introduce basic knowledge engineering techniques used in developing the Ferret knowledgebase • Describe and demonstrate the added capabilities and enhancements for commercial version of Ferret (“Q)
Y-12 Classification SupportBackground • The Oak Ridge Y-12 National Security Complex (NSC) is a DOE nuclear weapons design and production facility. • Because nuclear weapons design information comprise some of the country’s most vital secrets, Y-12 has been in the information protection business for a long time. • We developed a rule-based inference engine, Ferret, to assist the Authorized Derivative Classifiers (ADCs) in analyzing and classifying documents
Ferret Background • Ferret development at Y-12 began about 4 years ago as an offshoot of DOE’s automated expert classification project, Reviewer Assistant System (RAS) • Developed as a software assist for Authorized Derivative Classifiers at Y-12. • Y-12 classification guide knowledge base is being used by ADCs • Comprises about 4500 concepts and over 1000 rules • Ferret development has been funded by the Y-12 National Security Complex, through support from DOE and NNSA • Ferret technology is patented and is being commercialized by the commercial licensee, AreteQ, Inc. • Refinements and enhancements to the Ferret code, called “Q”, and functionality are being made that improve speed and performance
“Ferret”(Y-12 Classification Application) • Expert system to find classified information within U.S. Nuclear Weapons Complex • Identifies occurrences (“hits”) of information of value • Finds relevant data excerpts (information) within long streams • Mimics what a human analyst would surmise in given context • Reviews plain text, no formatting required • Increases reviewer productivity & review accuracy Information of Value
Classification System Comparisons • Ferret is a rule-based expert system • Other classification software uses neural net, statistical analysis, or algorithm-driven pattern matching for categorization or classification • Rule-based systems require larger upfront investment required, however, results in greater accuracy within specific knowledge domain • Simple example, showing the distinction between a key word or “dirty word” search, and Ferret
Consider the following dialogue: “It was a dark street at the corner of Bush and Stockton streets. It was a long street, a dark street, a mysterious street, a deserted street. I noticed that the street lights were growing dimmer as I approached the end of the street and casting dark shadows against the brick walls. When I got to the end of the street I was attacked by a mugger.”
Information(Expert) Policy Classification Role of Classification topic Keys A B C “Treasure” topic topic topic
Classification and Ferret • Classifiers identify the presence of sensitive information (information of value to an adversary) that can readily arise in “Enterprise” information • Information of value is characterized by classification topics, essentially all of which can be cast in the form: “ID# If ( A, B, C,..) then classified • A, B, C are Trigger Concepts • (A, B, C…) prescribes a logical relationship between the concepts • If (A, B, C…) is present in a “chunk” of text, then the topic with ID# is relevant • Much of the art of classification is to know when sensitive collections of these trigger concepts (or their indicators) are present in a chunk of text • Ferret is an automated classifier that simulates the capabilities of a classification expert in identifying instances of IOV • Ferret provides fast and accurate classification by applying live expert knowledge with the speed and thoroughness of an electronic scan • Ferret has been most extensively used for guidance covering nuclear weapons design information, however has applications in several other areas
Aunt Granny’s Pound Cake Topic 23: The fact that baking powder, eggs, and hominy grits are used in Aunt Granny’s pound cake manufacture is SRD. • Keywords are Baking-powder, Egg(s), Hominy-grits, and Aunt-Granny’s-pound-cake. • First of all, there is only one single concept “Egg(s)”. • “Baking-powder,” “Hominy grits,” and “Aunt-granny’s-pound-cake” are all compound words. • Since baking powder is always referred to as “baking powder” and not “powder used for baking” or in some other odd semantic combination, one can simply create a compound word “Baking-powder” to take care of that problem. Likewise “Hominy grits” is also a semantic unit although, just to be safe, “Grits” might be an equivalent expression. • Thus, Network Topic 23: (Baking-powder, Egg, Hominy-Grits, Aunt-granny’s-pound-cake) → SRD
AutomatingInformation Detection Ferret Q Engine Domain Expert
Ferret Analysis Engine • Platform-independent, small-footprint component • Ferret Performance • Speed: up to 15 pages / second • Accurate: • Effectiveness: 90+% important content is detected* (<10% miss rate) • Relevancy: 80+% identified content is relevant* (<20% “false alarms”) • Flexible: Can be deployed in a variety of program types and operating environments for a wide range of purposes • Enables application of relevant human intelligence with the speed and thoroughness of electronic scans * when supported by a mature knowledgebase
Ferret KnowledgeBase • Defines (and clarifies) enterprise information needs • Developed by subject matter experts • Specific to a “focused” domain of knowledge (limited range of concerns) • Can be imported in whole or part from other forms of community knowledge (KM systems, thesauri, etc) • Rule- and semantic network-based. Requires knowledge engineering investment to develop the knowledgebase • As opposed to neural-net, pattern-matching, or statistical/probability algorithms for categorization and classification • Preserves and diffuses knowledge • Does NOT replace the expert classifier, only makes him/her better
KnowledgeBase Components • Topics (defines classified information) • Triggers (Keywords) • Actions & Metadata • Keywords (field vernacular) • Relationships • Synonyms • Implies • Associations • Patterns • Regular Expressions (in commercial version of Ferret, “Q” • Formulae (in commercial version of Ferret, “Q”) context Forms a semantic network of related terms, a kind of thesaurus expert
Semantic network connects keywords to field vernacular, and hence to occurrences of classified information. Location of a specific force level at any given time. -- Secret Time when any unit designation is engaged. -- Confidential Enterprise InfoGoals / Policies Subject MatterExpert Knowledge
Ferret “Anatomy” A Ferret solution consists of: End-User ApplicationUser and I/O interface + business rules Ferret Inference EnginePerforms the analysis using SME knowledge Expert (Context) Knowledge BaseContains SME knowledge + relevant patterns +IOV rules End-User Application Service Requests Info Ferret Engine Data SME Knowledgebase
Classification GUI Monitoring and Notification Application Ferret Uses • Classification • Categorization • Email Checkand Classification Marking • Content Monitor • Enhanced Searchof Web Sites and Disk Files • Content-Based NTK • Mosaic/Aggregation Determination • Redaction Email Filtering Application (SMTP/MIME) Multi-File Analysis Application
Observations on Ferret Classification Technology • Ferret Concept: Master classifier for a “community of interest” • Can be used to cost-effectively capture and apply enterprise knowledge/understanding • Can enhance the productivity of classification • Can provide a community-wide standard for classification excellence • Fundamental Assumption: Knowledge base approach is capable of representing, cost-effectively, knowledge/understanding at the Master level • Strategy for Ferret Development: “You and other members of the classification community are developing a new, but very promising, individual who has the potential to become “the best” classifier in the classification community--the ‘Master Classifier.” The community is creating a truly master classifier.
Summary-- Capabilities and Nature of Ferret • Ferret identifies information of value. • Fast and accurate. • In use at the Y-12 National Security Complex to protect national defense information. • Enhances productivity of ADC in the review and analysis of documents containing potentially sensitive information • Simulates the expertise of the SME and classification community. “For Community, by Community” • Effective tool for knowledge preservation • Applicable to other information protection areas.