100 likes | 215 Vues
tOKo from TOKens to Ontologies. Anjo Anjewierden Human-Computer Studies laboratory University of Amsterdam http://staff.science.uva.nl/~anjo http://anjo.blogs.com/metis/. Overview. tOKo for end-users (this presentation) Help intelligent users develop ontologies from documents
E N D
tOKofrom TOKens to Ontologies Anjo Anjewierden Human-Computer Studies laboratory University of Amsterdam http://staff.science.uva.nl/~anjo http://anjo.blogs.com/metis/
Overview tOKo for end-users (this presentation) • Help intelligent users develop ontologies from documents • Approach is to offer useful functionality (possibly smart) that applies to all kinds of documents • Demonstration: imagine you are an end-user who is given the task to develop an ontology for the cooking domain tOKo for researchers (second presentation) • Accessing tOKo using HTTP • Infrastructure • Information extraction and ontology-based search
Infrastructure (1) • Dictionaries (English, Dutch, German) • Used for word classes, inflections and spelling • Document representation (=corpus) • Low-level representation, highly indexed, fast access • Prolog primitives to access the corpus • corpus_pattern([word(Word), integer(Int)], Doc,From,To) • Searches for a Word immediately followed by an Int. For example: “room is A306” unifies Word with “A” and Int with “306”. Doc,From,To is unified with the document and document position.
Infrastructure (2) • Lots of higher level primitives (this one is • used in the HTTP demo. Note: little knowledge of Prolog required) word_frequencies_corpus(WFs, [ cases(alpha) , case(plain) , documents(all) , language(Language) , number_chars(2,infinite) , lemmatize(delete) ]).
Information Extraction • Phrases that may be concepts or attributes • 6 tbsp of sugar could be part of a recipe • 1089 WB could be an instance of the concept postal code • Such phrases don’t follow the rules of “language” • See demonstration for examples
Ontology-Based Corpus Searches • Query corpus with a combination of ontology constructs and language elements • Example: • [fruit] and [fruit] • Matches: “I bought some apples and pears” • Because [apple] is-a [fruit] (according to the ontology) and “apples” is the plural of “apples” (according to the dictionary)
Ontology-Based Text QL • Language constructs (provisional): • [concept] matches a concept (and sub-concepts) including inflections, synonyms, etc. in the corpus • (word) matches a word (incl. inflections, etc) • <word class> matches all members of the word class • @20 matches all (compound) terms that appear at least 20 times in the corpus • integer matches any integer • literal matches precisely that literal • Demonstration
Status • Usage • Ontology development (both research and contracted) • Document indexing (by Jan Jacobs and colleagues at Oce) • Finding inconsistencies in documents (has just started) • Research on top of tOKo (mostly using weblogs as a source, see my website for papers) • Caveats • Dictionary used is not “public” (CELEX) • Creating a corpus from an “arbitrary” set of documents may involve some programming (templates exist for HTML and plain text document sets)
Plan • Open Source? • Perhaps it is an idea to create an Open Source version • To do (for Open Source version) • Documentation (although lack of documentation has so far not been a problem for end-users) • Make infrastructure / external interfaces consistent • Some performance issues • Conclusion • Listen to users for good ideas!