1 / 19

Knowledge Engineering for Document Analysis and Classification

Knowledge Engineering for Document Analysis and Classification. Al Klein, BWXT Y-12, LLC Charles Wilson, AreteQ, Inc. . October 28, 2004. Outline for Presentation. Introduction and Background Capabilities and use of Ferret expert system for classification support

keon
Télécharger la présentation

Knowledge Engineering for Document Analysis and Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge Engineering for Document Analysis and Classification Al Klein, BWXT Y-12, LLC Charles Wilson, AreteQ, Inc. October 28, 2004

  2. Outline for Presentation • Introduction and Background • Capabilities and use of Ferret expert system for classification support • Introduce basic knowledge engineering techniques used in developing the Ferret knowledgebase • Describe and demonstrate the added capabilities and enhancements for commercial version of Ferret (“Q)

  3. Y-12 Classification SupportBackground • The Oak Ridge Y-12 National Security Complex (NSC) is a DOE nuclear weapons design and production facility. • Because nuclear weapons design information comprise some of the country’s most vital secrets, Y-12 has been in the information protection business for a long time. • We developed a rule-based inference engine, Ferret, to assist the Authorized Derivative Classifiers (ADCs) in analyzing and classifying documents

  4. Ferret Background • Ferret development at Y-12 began about 4 years ago as an offshoot of DOE’s automated expert classification project, Reviewer Assistant System (RAS) • Developed as a software assist for Authorized Derivative Classifiers at Y-12. • Y-12 classification guide knowledge base is being used by ADCs • Comprises about 4500 concepts and over 1000 rules • Ferret development has been funded by the Y-12 National Security Complex, through support from DOE and NNSA • Ferret technology is patented and is being commercialized by the commercial licensee, AreteQ, Inc. • Refinements and enhancements to the Ferret code, called “Q”, and functionality are being made that improve speed and performance

  5. “Ferret”(Y-12 Classification Application) • Expert system to find classified information within U.S. Nuclear Weapons Complex • Identifies occurrences (“hits”) of information of value • Finds relevant data excerpts (information) within long streams • Mimics what a human analyst would surmise in given context • Reviews plain text, no formatting required • Increases reviewer productivity & review accuracy Information of Value

  6. Classification System Comparisons • Ferret is a rule-based expert system • Other classification software uses neural net, statistical analysis, or algorithm-driven pattern matching for categorization or classification • Rule-based systems require larger upfront investment required, however, results in greater accuracy within specific knowledge domain • Simple example, showing the distinction between a key word or “dirty word” search, and Ferret

  7. Consider the following dialogue: “It was a dark street at the corner of Bush and Stockton streets. It was a long street, a dark street, a mysterious street, a deserted street. I noticed that the street lights were growing dimmer as I approached the end of the street and casting dark shadows against the brick walls. When I got to the end of the street I was attacked by a mugger.”

  8. Information(Expert) Policy Classification Role of Classification topic Keys A B C “Treasure” topic topic topic

  9. Classification and Ferret • Classifiers identify the presence of sensitive information (information of value to an adversary) that can readily arise in “Enterprise” information • Information of value is characterized by classification topics, essentially all of which can be cast in the form: “ID# If ( A, B, C,..) then classified • A, B, C are Trigger Concepts • (A, B, C…) prescribes a logical relationship between the concepts • If (A, B, C…) is present in a “chunk” of text, then the topic with ID# is relevant • Much of the art of classification is to know when sensitive collections of these trigger concepts (or their indicators) are present in a chunk of text • Ferret is an automated classifier that simulates the capabilities of a classification expert in identifying instances of IOV • Ferret provides fast and accurate classification by applying live expert knowledge with the speed and thoroughness of an electronic scan • Ferret has been most extensively used for guidance covering nuclear weapons design information, however has applications in several other areas

  10. Aunt Granny’s Pound Cake Topic 23: The fact that baking powder, eggs, and hominy grits are used in Aunt Granny’s pound cake manufacture is SRD. • Keywords are Baking-powder, Egg(s), Hominy-grits, and Aunt-Granny’s-pound-cake. • First of all, there is only one single concept “Egg(s)”. • “Baking-powder,” “Hominy grits,” and “Aunt-granny’s-pound-cake” are all compound words. • Since baking powder is always referred to as “baking powder” and not “powder used for baking” or in some other odd semantic combination, one can simply create a compound word “Baking-powder” to take care of that problem. Likewise “Hominy grits” is also a semantic unit although, just to be safe, “Grits” might be an equivalent expression. • Thus, Network Topic 23: (Baking-powder, Egg, Hominy-Grits, Aunt-granny’s-pound-cake) → SRD

  11. AutomatingInformation Detection Ferret Q Engine Domain Expert

  12. Ferret Analysis Engine • Platform-independent, small-footprint component • Ferret Performance • Speed: up to 15 pages / second • Accurate: • Effectiveness: 90+% important content is detected* (<10% miss rate) • Relevancy: 80+% identified content is relevant* (<20% “false alarms”) • Flexible: Can be deployed in a variety of program types and operating environments for a wide range of purposes • Enables application of relevant human intelligence with the speed and thoroughness of electronic scans * when supported by a mature knowledgebase

  13. Ferret KnowledgeBase • Defines (and clarifies) enterprise information needs • Developed by subject matter experts • Specific to a “focused” domain of knowledge (limited range of concerns) • Can be imported in whole or part from other forms of community knowledge (KM systems, thesauri, etc) • Rule- and semantic network-based. Requires knowledge engineering investment to develop the knowledgebase • As opposed to neural-net, pattern-matching, or statistical/probability algorithms for categorization and classification • Preserves and diffuses knowledge • Does NOT replace the expert classifier, only makes him/her better

  14. KnowledgeBase Components • Topics (defines classified information) • Triggers (Keywords) • Actions & Metadata • Keywords (field vernacular) • Relationships • Synonyms • Implies • Associations • Patterns • Regular Expressions (in commercial version of Ferret, “Q” • Formulae (in commercial version of Ferret, “Q”) context Forms a semantic network of related terms, a kind of thesaurus expert

  15. Semantic network connects keywords to field vernacular, and hence to occurrences of classified information. Location of a specific force level at any given time. -- Secret Time when any unit designation is engaged. -- Confidential Enterprise InfoGoals / Policies Subject MatterExpert Knowledge

  16. Ferret “Anatomy” A Ferret solution consists of: End-User ApplicationUser and I/O interface + business rules Ferret Inference EnginePerforms the analysis using SME knowledge Expert (Context) Knowledge BaseContains SME knowledge + relevant patterns +IOV rules End-User Application Service Requests Info Ferret Engine Data SME Knowledgebase

  17. Classification GUI Monitoring and Notification Application Ferret Uses • Classification • Categorization • Email Checkand Classification Marking • Content Monitor • Enhanced Searchof Web Sites and Disk Files • Content-Based NTK • Mosaic/Aggregation Determination • Redaction Email Filtering Application (SMTP/MIME) Multi-File Analysis Application

  18. Observations on Ferret Classification Technology • Ferret Concept: Master classifier for a “community of interest” • Can be used to cost-effectively capture and apply enterprise knowledge/understanding • Can enhance the productivity of classification • Can provide a community-wide standard for classification excellence • Fundamental Assumption: Knowledge base approach is capable of representing, cost-effectively, knowledge/understanding at the Master level • Strategy for Ferret Development: “You and other members of the classification community are developing a new, but very promising, individual who has the potential to become “the best” classifier in the classification community--the ‘Master Classifier.” The community is creating a truly master classifier.

  19. Summary-- Capabilities and Nature of Ferret • Ferret identifies information of value. • Fast and accurate. • In use at the Y-12 National Security Complex to protect national defense information. • Enhances productivity of ADC in the review and analysis of documents containing potentially sensitive information • Simulates the expertise of the SME and classification community. “For Community, by Community” • Effective tool for knowledge preservation • Applicable to other information protection areas.

More Related