350 likes | 493 Vues
Quality Taxonomies. Jim Nisbet Senior Vice President of Technology Semio Corporation Knowledge Technologies 2001 March 5 th , 2001. Ontology / Taxonomy. Static Discovery. Root Ontology. Taxonomy Generation. Dynamic Discovery. What is Quality ?. “Best value for the money”
E N D
Quality Taxonomies Jim Nisbet Senior Vice President of Technology Semio Corporation Knowledge Technologies 2001 March 5th, 2001
Ontology / Taxonomy Static Discovery Root Ontology Taxonomy Generation Dynamic Discovery
What is Quality ? • “Best value for the money” • According to this definition, you are entitled to get high performance from a costly product; likewise a low cost product or service is expected to be a poor delivery. For example, a loose demo delivery is both predictable and acceptable, since its quality is: low conformance / low cost.
What is Quality ? • “Good Quality is Nominal Conformance” • Taxonomy Quality is defined as Taxonomy Conformance to: • Valid requirements; • Explicitly documented development standards; and, • Implicit characteristics that are expected of all professionally developed taxonomies, such as the desire for good maintainability.
Standards • ISO 2788-1986 • International Organization for Standardization. Documentation—Guidelines for the Establishment and Development of Monolingual Thesauri. 2nd ed. n.p.: ISO, 1986. (ISO 2788-1986(E)). (Available in the U.S. from American National Standards Institute) • ISO 5964-1985 • International Organization for Standardization. Documentation—Guidelines for the Establishment and Development of Multilingual Thesauri. n.p.: ISO, 1985. (ISO 5964-1985(E)). (Available in the U.S. from American National Standards Institute) • ANSI/NISO Z39.19-1993 • National Information Standards Institute. Guidelines for the Construction, Format, and Management of Monolingual Thesauri. Bethesda, MD: NISO Press, 1994. 69p. (ANSI/NISO Z39.19-1993) • SEMIO Quality Plan v1 2000 • ISO/IEC 13250 Topic Maps • RDF • Please refer to RDF at http://www.w3.org/RDF and XML at http://www/w3/org/XML
Project Plan • Kick-off • Requirements Review • Lexicon Review • Taxonomy Review • Tags Review • Final Review
1. Kick-off • Objectives • Purpose • Scope • Scale • Users • Conditions of receipt • Roles • Supplier • Customer • Admin • KE • Experts • Users • Planning • Training and Transfer
2. Requirements Review • Sources • Lexicon • Ontology • Install
Sources • Dispersion (Multiplicity, Size, Homogeneity) • Refresh • Access
Typical Patterns • Disparity • Adjust sources • Adjust crawl strategy • Isolate communities / taxonomies
Lexicon • Vocabularies, etc. • Substitutions: Acronyms, Synonyms, etc. • Preferred Keywords: Brand Names, etc. • Banned Keywords
Typical Patterns • Lack of requirements • Use Librarian Resources
Ontology • Thesaurus ? • Is the information domain analysis complete, consistent, and accurate ? • Is the partitioning of the problem complete ?
Typical Patterns • Directory versus Taxonomy • Isolate “directory” branches • Thesaurus versus Taxonomy • Put an ontology on top of thesaurus • Check ASAP match of thesaurus generics with extracted lexicon • Very high level design for top categories requirements • Plan to work bottom-up • See also Taxonomy (functions, combinations, etc.)
Install • Implementation / Integration: • Are external and internal interfaces properly defined? • Are all requirements traceable to the system level? • Has prototyping been conducted for the user/customer? • Is performance achievable within the constraints imposed by other system elements? • Are requirements consistent with schedule, resources, and budget?
Typical Patterns • Scale • Security • Missing Documents
3. Lexicon Review • Coverage • Extracted words / Words • (Extracted Index / Index) • Sources bench-marking • Coverage • Extraction quality • Topic distribution • Structure • Most Frequent Phrases • Most Productive Generics • Substitutions • Exceptions
Typical Patterns • Low level of frequency / quality for the most meaningful content • Increase size of value corpus • Filter and re-import lexicon
4. Taxonomy Review • Taxonomy Operation • Correctness • Reliability • Usability • Integrity • Efficiency • Taxonomy Revision • Maintainability • Flexibility • Testability • Taxonomy Transition • Portability • Reusability • Interoperability
Tax Liability Loan Term loan Short-term loan Folk Taxonomies Design The Berlin and Kay model: Taxonomy = Nomenclature + Terminology Unique Beginner Life Form Generic Specific Varietal
Correctness • Accuracy • Completeness • Consistency
Accuracy • Precision • Recall
Completeness Taxonomy Maps Lexicon Collection
Tagging Taxonomy Maps Lexicon Document Collection Concentration Works Against Quality • Tagging Coverage • Ontology Coverage • Hook Coverage • Map Coverage • Lexical Coverage • Collection Coverage
Consistency:Typical Patterns • Objectivization • Hyperonymy • Speciation • Necessity
Employment Firing Hiring Salaries Avoid functional categories Don’t mix functions / objects Exhaust scripts Match idiomatic phrases Objectivization
Parts Air Conditioning Belts and Hoses Body Brake System Chassis Engine Exhaust System Fuel System Glass Ignition Avoid meronymy Don’t mix meronymy / hyperonymy Exhaust prototypes Genericity
Person Unwelcome person Unpleasant person Selfish person Opportunist Backscratcher Avoid “strings” of categories Avoid (non-idioms) properties for categories Speciation (WordNet)
Necessity • Avoid non-productive categories • Avoid combinations of categories
Nomenclature (Design Structure) Quality Index • Depth • Width • Balance
Complexity Index • Cyclometric complexity increases with number of Cross References within the Taxonomy, giving an indication of complexity and difficulty of testing. • Taxonomy Complexity Index combines: • autonomy • closure • similarity • typicality • commonality • redundancy • stability
Maturity index • The IEEE standard 982.1-1988 suggests a taxonomy maturity index to provide an indication of the stability of the taxonomy . • Maturity Index combines: • number of modules in current ontology / taxonomy. • number of modules in current ontology / taxonomy that have been changed. • number of modules added to current ontology / taxonomy. • number of modules deleted from the previous version of the ontology / taxonomy.
5. Tags Review • Document coverage • Concepts coverage <tagset> <document> <docurl>http://www.TaxSource.com</docurl> <tag> <tagname>Liability</tagname> <weight>1.289</weight> </tag> <tag> <tagname>Federal Funds</tagname> <weight>0.746</weight> </tag> </document></tagset>
6. Final Review • Receipt • Maintenance
Quality Taxonomies Jim Nisbet niz@semio.com Knowledge Technologies 2001