1 / 88

Project Overview

Project Overview. Discovering Concepts Hidden in the Web Tsau Young (‘T. Y.’) Lin Computer Science Department, San Jose State University San Jose, CA 95192-0249, USA tylin@cs.sjsu.edu. Main results.

benoit
Télécharger la présentation

Project Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project Overview Discovering Concepts Hidden in the Web Tsau Young (‘T. Y.’) Lin Computer Science Department, San Jose State University San Jose, CA 95192-0249, USA tylin@cs.sjsu.edu

  2. Main results A set of documents is associated with a Matrix, called Latent Semantic Index(LSI), Then by treating the row vectors as Euclidean space points(point=TFIDF), The document is clustered(categorized) polyhedron, the association is believed to be one-to-one Corollary: A set of English documents and their Chinese translations can be identified via their semantics automatically.

  3. Main results A set of documents is associated with a polyhedron, the association is believed to be near one-to-one Corollary: A set of English documents and their Chinese translations can be identified via their semantics automatically.

  4. Main results This is identified by semantics,as there is no explicit correspondence between two sets of documents.

  5. Outline 1. Introduction Domain: Information Ocean Methodology: Granular Computing Reaults 2. Intuitive View of Granular Computing 3. A Formal Theory 4. 2

  6. Current State • Current search engines are syntactic based systems, they often return many meaningless web pages • Cause: Inadequate semantic analysis, and lack of semantic based organization of information ocean.

  7. Information Ocean • Internet is an information ocean. • It needs a methodology to navigate. • A new methodology-Granular Computing

  8. Granular Computing-a methodology The term granular computing is first used to label a subset of Zadeh’s granular mathematics as my research area in BISC, 1996-97 (Zadeh, L.A. (1998) Some reflections on soft computing, granular computing and their roles in the conception, design and utilization of information/intelligent systems, Soft Computing, 2, 23-25.)

  9. Granular computing Since, then, it has grown into an active research area: • books, sessions, workshops (Zhong, Lin was the first independent conference using Name GrC; there has several in JCIS) • IEEE task force

  10. Granular Computing Granulation seems to be a natural problem-solving methodology deeply rooted in human thinking. Human body has been granulated into head, neck, and etc.

  11. Granulating Information Ocean • In this talk, we will explain how we granulate the semantic space of information ocean that consists of millions of web pages

  12. Organizing Information Ocean • How to organize the information ocean? • Considering the Semantics Space

  13. Latent Semantic Space • A set of documents/web pages carries certain human thoughts. We will call the totality of these thoughts • Latent semantic space (LSS); • (recall Latent Semantic Index(LSI)

  14. Classification & clustering In data mining, • a classification means identify an unseen object with one of the known classes in a partition • Clustering means classify a set of object into disjoint classes based on similarity, distance, and etc.; the key ingredient here is the classes are not known apriori.

  15. Categorizing Information • Multiple concepts can simultaneously exist in a single web page, So to organize web pages, a powerful Clustering method is needed. (The # of concepts can not be known apriori)

  16. Latent Semantic Space(LSS) • The simplest representations of LSS? • A Set of Keywords • LSI

  17. Latent Semantic Index

  18. TFIDF Definition 1. Let Tr denote a collection of documents. The significance of a term ti in a document dj in Tr is its TFIDF value calculated by the function tfidf(ti, dj), which is equivalent to the value tf(ti, dj) · idf(ti, dj). It can be calculated as TFIDF(ti; dj)=tf(ti; dj)log |Tr|/|Tr(ti)

  19. TFIDF where Tr(ti)denotes the number of documents in Tr in which ti occurs at least once, 1 +log(N(ti; dj))if N(ti; dj)> 0 tf(ti; dj) = 0 otherwise where N(ti, dj) denotes the frequency of terms ti occurs in document dj by counting all its nonstop words.

  20. TFIDF where Tr(ti)denotes the number of documents in Tr in which ti occurs at least once, 1 +log(N(ti; dj))if N(ti; dj)> 0 tf(ti; dj) = 0 otherwise where N(ti, dj) denotes the frequency of terms ti occurs in document dj by counting all its nonstop words.

  21. Latent Semantic Index Treat each row as a point in Euclidean space. Clustering such a set of points is a common approach (using SVD) Note that the points has very little to do with the semantic of documents

  22. Topological Space of LSS Euclidean space has many metics but has only one topology; We will use this one

  23. Keywords (0-Association) 1. Given by Experts 2. High TFIDF is a Keyword • “Wall”, “Door”. . ., “Street”, “Ave”

  24. Keywords Pairs (1-Association) • 1-association (“Wall”, “Street”)  financial notion, that nothing to do with the two vertices, “Wall” and “Street”

  25. Keywords Pairs (1-Association) • 1-association (“White”, “House”)  that nothing to do with the two vertices, “White” and “House”

  26. Keywords Pairs (1-Association) • 1-association (“Neural”, “Network”)  that nothing to do with the two vertices, “Wall” and “Street”

  27. Geometric Analogy-1- Simplex • (open) 1-simplex: (v0,v1)  open segment (“Wall”, “Street”)  financial notion, • End points (boundaries) are not included

  28. Keywords are abstract vertices • LSS of Documents/web pages  Simplicial Complex • A special Hypergraph • Polyhedron  Simplicial Complex

  29. r-Association • r-association Similarly r-association represents some semantic generated by a set of r keywords, moreover the semantics may have nothing to do with the individual keywords There are mathematical structure that reflects such properties; see next

  30. Topology:(Open) Simplex • 1-simplex: open segment (v0,v1) • 2-simplex: open triangle (v0,v1, v2) ; • 3-simplex: open tetrahedron (v0,v1, v2 , v3) • All boundaries are not included

  31. Topology: (Open) Simplex • A (open) r-simplex is the generalization of those low dimensional simplexes (segment, triangle and tetrahedron) to high dimensional analogy in r-space (Euclidean spaces of dimension r) • Theorem. r-simplex uniquely determines the r+1 linearly independent vertices, and vice versa

  32. Face • The convex hull of any m vertices of the r-simplexis called an m-face. • The 0-faces are the vertices, the 1-faces are the edges, 2-faces are triangles, and the single r-face is the whole r-simplex itself.

  33. A line segment where two faces of a polyhedron meet, also called a side.

  34. n-Complex • A simplicial complexC is a finite set of simplices such that: • Any face of a simplex from C is also in C. • The intersection of any two simplices from C is either empty or is a face for both of them • If the maximal dimension of the constituting simplices is n then the complex is called n-complex.

  35. Upper/Closure approximations Let B(p), p  V, be an elementary granule U(X)= {B(p) | B(p)  X = } (Pawlak) C(X)= {p | B(p)  X = } (Lin-topology)

  36. Upper/Closure approximations Cl(X)= iCi(X) (Sierpenski-topology) Where Ci(X)= C(…(C(X))…) (transfinite steps) Cl(X) is closed.

  37. New View Divide (and Conquer) Partition of set (generalize) ? Partition of B-space (topological partition)

  38. New View:B-space The pair (V, B) is the universe, namely an object is a pair (p, B(p)) where B: V  2V ;  p  B(p) is a granulation

  39. Derived Partitions The inverse images of B is a partition (an equivalence relation) C ={Cp | Cp =B –1 (Bp) p  V}

  40. Derived Partitions • Cp is called the center class of Bp • A member of Cpis called a center.

  41. Derived Partitions • The center class Cp consists of all the points that have the same granule • Center class Cp = {q | Bq= Bp}

  42. C-quotient set The set of center classes Cp is a quotient set US, UK, . . . Iran, Iraq. . Russia, Korea

  43. New Problem Solving Paradigm (Divide and) Conquer Quotient set  Topological Quotient space

  44. Neighborhood of center class • C (in the case B is not reflexive) B-granule/neighborhood C-classes C-classes

  45. Neighborhood of center class C-classes B-granule C-classes

  46. Topological partition B-granule/neighborhood Cp -classes Cp -classes

  47. New Problem Solving Paradigm (Divide and) Conquer Quotient set  Topological Quotient space

  48. Topological partition B-granule/neighborhood Cp -classes Cp -classes

  49. Topological partition B-granule/neighborhood Cp -classes Cp -classes

  50. Topological partition B-granule/neighborhood Cp -classes Cp -classes

More Related