210 likes | 314 Vues
Indexing Knowledge. Daniel Vasicek 2014 March 27. Introduction. Basic topic is : All Human Knowledge Who Cares? Simple Examples. Basic Ideas. Concepts instead of key words Thesauri instead of key words Recognize Emerging concepts Classification
E N D
Indexing Knowledge Daniel Vasicek 2014 March 27
Introduction • Basic topic is : All Human Knowledge • Who Cares? • Simple Examples
Basic Ideas • Concepts instead of key words • Thesauri instead of key words • Recognize Emerging concepts • Classification • Facilitate communication between environments (Data translation) • Meta data for publications (xml, sql, txt) • Indexing information
Topics to Cover • Programming language constructs needed. What functionality do we need? • What people pay Access Innovations to do? • Typical programming problems that I encounter.
Input Data • Formats • XML tagged meta data for publications • SQL data base • RAW text • Pictures of text • Quantities • AIP • 304,910 authors as xml files • 807,005 xml files containing title, abstract, +meta data • Nicem (National Information Center for Educational Media) • 503,534 xml files describing available educational media • 26,144 xml files describing suppliers of educational media
Programming Languages Used • Visual Basic (1990s) • C++ • Java (currently)
Who Cares? • AIP – American Institute of Physics (17 journals + conference proceedings) • IEEE- Institute of Electronic and Electrical Engineers (journals, standards, patents, …) • SPIE- International Society for Optics and Photonics • ACM – Association of Computing Machinery • Wolters-Klewer • Pub-Med
More Clients • Parliament of Victoria (5000 articles per day) • JSTOR (~10 million documents, some journals back to 1665) • PLOS (quick path to electronic publication) • Dupont • DOW • Council of Europe • Triumph Learning • ASCE, SAGE, SafetyLit, OSA, NICEM, NPR …
Useful Tools • Controlled Vocabulary – an organizational tool for capturing concepts • Proximity – a tool for capturing context • Hash Table (Content Addressable Array) • Convenience • Uniqueness • Fast access • Regular Expressions
What’s a taxonomy? • Knowledge organization system • Words • Controlled vocabulary for a subject area • Descriptive labels • Hierarchy • Simple hierarchical view of a thesaurus • Storage and retrieval aid
Thesaurus Elements • Hierarchy • Broader and Narrower concepts • Multiply connected “treelike” structure • Nodes in the thesaurus structure contain descriptions of concepts and links to broader, narrower, related, and similar concepts • Subject specific?
Structure of Controlled Vocabularies Flat List Synonym Ring Taxonomy Thesaurus Ontology INCREASING MEANING and CONTROL Ambiguity Ambiguity Synonym Ambiguity Synonym Hierarchy Relationships Synonym Hierarchy Additional Types of Relationships Hierarchy After ANSI/NISOZ39.19 -2005, Figure 5
Thesaurus Node (Term) Science Broader Term Biology Narrower Term Science of Life Synonym
Thesaurus Implementation • Terms (Concepts, Preferred Terms) • Broader Terms • Narrower Terms • Related Terms • Other Concepts • Synonyms • History • Responsibility • Backup • Rules to help identify the concept in text • Methods for maintaining the thesaurus
Thesaurus Text Representation <TermInfo> <T>Biology</T> <BT>Science</BT> <UF>Science of Life</UF> </TermInfo> <TermInfo> <T>Science</T> <NT>Biology</NT> </TermInfo> <TermInfo> <T>Science of Life</T> </TermInfo>
Thesaurus Problems • Missing Terms - pointer links to a term that is not present • Broken loops • Narrower term without matching broader term • Broader term without matching narrower term • Related term without a matching return relationship
Proximity of Words • Adjacent • Before • After • Same sentence • Same Paragraph • Within 50 words • Phrases (n-Grams)
Content Addressable Array T[“Science”]=1; T[“Biology”]=1; T[“Science of Life”]=1; BT[“Biology”] = “Science”; NT[“Science”] = “Biology”; UF[“Science of Life”]=“Biology”;
Regular Expressions • /^[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})$/ • Email addresses? • / [A-Z][a-z]* / • Capitalized words • /[A-Z][a-zA-Z0-9,\”\- ]*\. / • Sentence ? • Paragraph?
Structure of Controlled Vocabularies Flat List Synonym Ring Taxonomy Thesaurus Ontology INCREASING MEANING and CONTROL Ambiguity Ambiguity Synonym Ambiguity Synonym Hierarchy Relationships Synonym Hierarchy Additional Types of Relationships Hierarchy After ANSI/NISOZ39.19 -2005, Figure 5