1 / 25

Concept Space Construction

Concept Space Construction. Todd Littell 27SEP06. Roadmap. Last Year: High quality concept space construction performed offline. First version of inference engine via graph operations. This Year: High quality concept space construction performed online.

eamon
Télécharger la présentation

Concept Space Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Concept Space Construction Todd Littell 27SEP06

  2. Roadmap • Last Year: • High quality concept space construction performed offline. • First version of inference engine via graph operations. • This Year: • High quality concept space construction performed online. • Second implementation of reasoning engine or graph mining application? • Future: • Online semantic net construction. • Sophisticated reasoning engine.

  3. Semantic Net • Goal: Efficient, high-quality learning of a semantic network from text. • Semantic Net is a kind of Knowledge Representation (KR) structure that is typically represented by a graph model. • What does a Semantic Net comprise of? • Concepts • Entities • Types • Relationships: associative, categorical, functional, structural, mechanical, temporal, spatial… • Concept Space or Association Network is a simplified SN that captures concepts and concept-associations. Ref: www.jfsowa.com: “A cat is on the mat”.

  4. Applications • Uses of Semantic Nets: • Knowledge Base for Reasoning and Inference: Fuzzy ER Model, Markov Net, Bayesian Net, Causal Net, etc. • Information Browsing: Navigation thru space, drill-up, drill-down, drill-across. • Domain Modeling: Communication, Information System construction, etc. • Query Formulation for Retrieval: Feedback, User Modeling, etc.

  5. Related Knowledge Models • Other kinds of knowledge models: • Concept Graphs: see John Sowa’s site: http://www.jfsowa.com/cg/index.htm, defines rules for assertions, composition and limited inference. Also OWL/RDF. • Graphical Models: Belief Nets, Causal Diagrams, Signed Directed Graphs, Lattices. • Concept Lattices: FCA – mathematical theory for objects, attributes & mappings. • UML: formal modeling notation & semantics defined specifically for software industry. • Express-G: formal modeling notation & semantics defined for engineering. • Moore/Mealy/Petri Nets: actionable semantics for modeling & simulation. • Aspects: domain, expressability, ad hoc vs. well-defined semantics, discrete vs. continuous, typing, purpose/application, underlying theory, etc.

  6. Why Bother?

  7. Levels of Complexity

  8. Example Semantic Net

  9. Associative Net (1)

  10. Associative Net (2)

  11. Associative Net (3) Ref: Mark Steyvers, Joshua B. Tenebaum, “The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth”, Cognitive Science 29, 2005.

  12. Basic Algorithm • Calculate measure of association between two terms using a similarity measure and output best associations. Let T = set of terms, D = set of documents, F(t,d) = freq of t in d. Let adj(t) := {d | f(t,d) > 0 } be term adjacency list. Let adj(d) := {t | f(t,d) > 0 } be document adjacency list. For each t1 in T For each d in adj(t1) For each t2 in adj(d) s1 += g(f(t1,d), f(t2,d)) Calculate s := h(s1, params) if (s >= thresh) output (t1, t2, s). • Note: • Term vectors are sparse –don’t need to iterate through all dimensions. • Many similarity/distance measures exist, as well as other kinds of measures. • Characteristic of “ideal similarity metric” is tied to application. • All calculations are independent, hence easily parallelizable.

  13. BeeSpace Variations • Only interested in calculating for representative terms with thresh1 < freq < thresh2. • In some cases, only interested in calculating for user-specified documents. Implies: • where D_R,D_C are selection matrices and F is term-by-doc freq matrix. • Only need to output top K similar terms. • Only need to output terms with sim > thresh3.

  14. Co-occurrence Metrics • Many extensions possible for incorporating weighting functions, features such as POS tags, context, word distance, window size, etc. • Ref: Frequency Estimates for Statistical Word Measures by Terra & Clarke. • Similarity Reqs: s(x,y) >= 0; s(x,y) > s(x,z) => y is more similar to x than z; optionally s(x,y)=s(y,x).

  15. MI & PWMI assuming MLE f(x,y) := |adj(x)^adj(y)| = sum_d(1 : f(x,d)>0 & f(y,d)>0)

  16. MI & PWMI Generalizations • Obvious generalizations: utilize parameters p(d): • Non-obvious generalizations: Ref: Barry Robson, “Clinical and Pharmacogenomic Data Mining…”, Journal of Proteome Research, 2003. Ref: Jonathan Wren, “Extending the mutual information measure to rank inferred literature relationships”, BMC Bioinformatics, 2004.

  17. Results • Look at results in spreadsheet…

  18. “Metric Closeness”

  19. “Make a Faster Wheel” • Optimized I/O • Parallelize • Better Algorithm • Better Code • Smarter Data Structures

  20. Optimize I/O • Only use formatted I/O for human consumption; use binary I/O for all other cases. • Use buffered I/O if reading/writing small chunks at a time. • See handout.

  21. Parallelize • Do the problems split naturally? • Divide-n-conquer apply? • Level of parallelization: • Very coarse grained: distributed agents. • Coarse grained: parallel jobs. • Medium grained: forked processes. • Fine grained: multi-threaded.

  22. Better Algorithm • How to compare algorithms? • Time complexity. • Space complexity.. • Parallelizable. • Time-to-develop.

  23. Better Code • Know your language. • Factor invariant expressions outside of loops. • Pre-compute whenever possible: cache results. • Sacrifice OO-ness. • Customized data structures. • Optimized I/O. • Avoid “long” calls (e.g. network, disk, etc.). • Tune to memory hierarchy.

  24. Smarter Data Structures • Understand your language’s built-in collections library. • Roll-your-own data structures can often out perform generic libraries. Why? • Hybrid techniques.

  25. To-Do/Unresolved • Decide what the complete set of applications will be for this component: browsing, inference, retrieval, etc. • Evaluate metrics using SME. • Decide what set of mined relations are significant for the applications in (a). • Investigate more advanced methods and compare trade-offs.

More Related