350 likes | 511 Vues
This document, authored by Jim Nisbet, Senior Vice President of Technology at Semio Corporation, delves into the intricate relationship between quality, taxonomy, and ontology. It explores definitions of quality, including "best value for the money" and "nominal conformance," while emphasizing the need for adherence to standards such as ISO and ANSI. The text outlines the methodologies for taxonomy generation, quality evaluation, and the importance of precise lexicon reviews. It serves as a vital guide for professionals involved in markup, classification, and management of information architectures.
E N D
Quality Taxonomies Jim Nisbet Senior Vice President of Technology Semio Corporation Knowledge Technologies 2001 March 5th, 2001
Ontology / Taxonomy Static Discovery Root Ontology Taxonomy Generation Dynamic Discovery
What is Quality ? • “Best value for the money” • According to this definition, you are entitled to get high performance from a costly product; likewise a low cost product or service is expected to be a poor delivery. For example, a loose demo delivery is both predictable and acceptable, since its quality is: low conformance / low cost.
What is Quality ? • “Good Quality is Nominal Conformance” • Taxonomy Quality is defined as Taxonomy Conformance to: • Valid requirements; • Explicitly documented development standards; and, • Implicit characteristics that are expected of all professionally developed taxonomies, such as the desire for good maintainability.
Standards • ISO 2788-1986 • International Organization for Standardization. Documentation—Guidelines for the Establishment and Development of Monolingual Thesauri. 2nd ed. n.p.: ISO, 1986. (ISO 2788-1986(E)). (Available in the U.S. from American National Standards Institute) • ISO 5964-1985 • International Organization for Standardization. Documentation—Guidelines for the Establishment and Development of Multilingual Thesauri. n.p.: ISO, 1985. (ISO 5964-1985(E)). (Available in the U.S. from American National Standards Institute) • ANSI/NISO Z39.19-1993 • National Information Standards Institute. Guidelines for the Construction, Format, and Management of Monolingual Thesauri. Bethesda, MD: NISO Press, 1994. 69p. (ANSI/NISO Z39.19-1993) • SEMIO Quality Plan v1 2000 • ISO/IEC 13250 Topic Maps • RDF • Please refer to RDF at http://www.w3.org/RDF and XML at http://www/w3/org/XML
Project Plan • Kick-off • Requirements Review • Lexicon Review • Taxonomy Review • Tags Review • Final Review
1. Kick-off • Objectives • Purpose • Scope • Scale • Users • Conditions of receipt • Roles • Supplier • Customer • Admin • KE • Experts • Users • Planning • Training and Transfer
2. Requirements Review • Sources • Lexicon • Ontology • Install
Sources • Dispersion (Multiplicity, Size, Homogeneity) • Refresh • Access
Typical Patterns • Disparity • Adjust sources • Adjust crawl strategy • Isolate communities / taxonomies
Lexicon • Vocabularies, etc. • Substitutions: Acronyms, Synonyms, etc. • Preferred Keywords: Brand Names, etc. • Banned Keywords
Typical Patterns • Lack of requirements • Use Librarian Resources
Ontology • Thesaurus ? • Is the information domain analysis complete, consistent, and accurate ? • Is the partitioning of the problem complete ?
Typical Patterns • Directory versus Taxonomy • Isolate “directory” branches • Thesaurus versus Taxonomy • Put an ontology on top of thesaurus • Check ASAP match of thesaurus generics with extracted lexicon • Very high level design for top categories requirements • Plan to work bottom-up • See also Taxonomy (functions, combinations, etc.)
Install • Implementation / Integration: • Are external and internal interfaces properly defined? • Are all requirements traceable to the system level? • Has prototyping been conducted for the user/customer? • Is performance achievable within the constraints imposed by other system elements? • Are requirements consistent with schedule, resources, and budget?
Typical Patterns • Scale • Security • Missing Documents
3. Lexicon Review • Coverage • Extracted words / Words • (Extracted Index / Index) • Sources bench-marking • Coverage • Extraction quality • Topic distribution • Structure • Most Frequent Phrases • Most Productive Generics • Substitutions • Exceptions
Typical Patterns • Low level of frequency / quality for the most meaningful content • Increase size of value corpus • Filter and re-import lexicon
4. Taxonomy Review • Taxonomy Operation • Correctness • Reliability • Usability • Integrity • Efficiency • Taxonomy Revision • Maintainability • Flexibility • Testability • Taxonomy Transition • Portability • Reusability • Interoperability
Tax Liability Loan Term loan Short-term loan Folk Taxonomies Design The Berlin and Kay model: Taxonomy = Nomenclature + Terminology Unique Beginner Life Form Generic Specific Varietal
Correctness • Accuracy • Completeness • Consistency
Accuracy • Precision • Recall
Completeness Taxonomy Maps Lexicon Collection
Tagging Taxonomy Maps Lexicon Document Collection Concentration Works Against Quality • Tagging Coverage • Ontology Coverage • Hook Coverage • Map Coverage • Lexical Coverage • Collection Coverage
Consistency:Typical Patterns • Objectivization • Hyperonymy • Speciation • Necessity
Employment Firing Hiring Salaries Avoid functional categories Don’t mix functions / objects Exhaust scripts Match idiomatic phrases Objectivization
Parts Air Conditioning Belts and Hoses Body Brake System Chassis Engine Exhaust System Fuel System Glass Ignition Avoid meronymy Don’t mix meronymy / hyperonymy Exhaust prototypes Genericity
Person Unwelcome person Unpleasant person Selfish person Opportunist Backscratcher Avoid “strings” of categories Avoid (non-idioms) properties for categories Speciation (WordNet)
Necessity • Avoid non-productive categories • Avoid combinations of categories
Nomenclature (Design Structure) Quality Index • Depth • Width • Balance
Complexity Index • Cyclometric complexity increases with number of Cross References within the Taxonomy, giving an indication of complexity and difficulty of testing. • Taxonomy Complexity Index combines: • autonomy • closure • similarity • typicality • commonality • redundancy • stability
Maturity index • The IEEE standard 982.1-1988 suggests a taxonomy maturity index to provide an indication of the stability of the taxonomy . • Maturity Index combines: • number of modules in current ontology / taxonomy. • number of modules in current ontology / taxonomy that have been changed. • number of modules added to current ontology / taxonomy. • number of modules deleted from the previous version of the ontology / taxonomy.
5. Tags Review • Document coverage • Concepts coverage <tagset> <document> <docurl>http://www.TaxSource.com</docurl> <tag> <tagname>Liability</tagname> <weight>1.289</weight> </tag> <tag> <tagname>Federal Funds</tagname> <weight>0.746</weight> </tag> </document></tagset>
6. Final Review • Receipt • Maintenance
Quality Taxonomies Jim Nisbet niz@semio.com Knowledge Technologies 2001