1 / 19

Mathias Verbeke, Bettina Berendt , Siegfried Nijssen Dept. Computer Science, KU Leuven

Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search. Mathias Verbeke, Bettina Berendt , Siegfried Nijssen Dept. Computer Science, KU Leuven. Agenda. Motivation Diversity  Diversity-aware tools  (our) Context

faolan
Télécharger la présentation

Mathias Verbeke, Bettina Berendt , Siegfried Nijssen Dept. Computer Science, KU Leuven

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search Mathias Verbeke, Bettina Berendt, Siegfried Nijssen Dept. Computer Science, KU Leuven

  2. Agenda • Motivation Diversity  Diversity-aware tools  (our) Context • Main part Measures of diversity  Tool • Outlook

  3. Motivation (1): Diversity is ... • Speaking different languages (etc.)  localisation / internationalisation • Having different abilities  accessibility • Liking different things  collaborative filtering • Structuring the world in different ways  ?

  4. Motivation (2): Diversity-aware applications ... • Must have a (formal) notion of diversity • Can follow a • “personalization approach“  adapt to the user‘s value on the diversity variable(s)  transparently? Is this paternalistic? • “customization approach“  show the space of diversity  allow choice / semi-automatic!

  5. (Our) Context • Diversity and Web usage: language, culture • Family of tools focussing on interactive sense-making helped by data mining • PORPOISE: global and local analysis of news and blogs + their relations • STORIES: finding + visualisation of “stories” in news • CiteseerCluster: literature search + sense-making • Damilicious: CiteseerCluster + re-use/transfer of semantics + diversity

  6. Measuring grouping diversity Diversity = 1 – similarity = 1 - Normalized mutual information By colour & NMI = 0 NMI = 0.35

  7. Measuring user diversity • “How similarly do two users group documents?“ • For each query q, consider their groupings gr: • For various queries: aggregate • “How similarly do two users group documents?“ • For each query q, consider their groupings gr:

  8. ... and now: the application domain ... that‘s only the 1st step!

  9. Workflow • Query • Automatic clustering • Manual regrouping • Re-use • Learn + present way(s) of grouping • Transfer the constructed concepts

  10. Concepts • Extension • the instances in a group • Intension • Ideally: “squares vs. circles“ • Pragmatically: defined via a classifier

  11. Step 1: Retrieve • CiteseerX via OAI • Output: set of • document IDs, • document details • their texts

  12. Step 2: Cluster • “the classic bibliometric solution“ • CiteseerCluster: • Similarity measure: co-citation, bibliometric coupling, word or LSA similarity, combinations • Clustering algorithm: k-means, hierarchical • Damilicious: phrases  Lingo • How to choose the “best“? • Experiments: Lingo better than k-means at reconstruction and extension-over-time

  13. Step 3 (a): Re-organise & work on document groups

  14. Step 3 (b): Visualising document groups

  15. Steps 4+5: Re-use • Basic idea: • learn a classifier from the final grouping (Lingo phrases) • apply the classifier to a new search result  “re-use semantics“ • Whose grouping? • One‘s own • Somebody else‘s • Which search result? • “ the same“ (same query, structuring by somebody else) • “ More of the same“ (same query, later time  more doc.s) • “ related“ (... Measured how? ...) • arbitrary

  16. Visualising user diversity (1) Simulated users with different strategies • U0: did not change anything (“System“) • U1: tried produce a better fit of the document groups to the cluster intensions; 5 regroupings • U2: attempted to move everything that did not fit well into the remainder group “Other topics”, & better fit; 10 regroupings • U3: attempted to move everything from „Other topics“ into matching real groups; 5 regroupings • U4: regrouping by author and institution; 5 regroupings  5*5 matrix of diversities gdiv(A,B,q)  multidimensional scaling

  17. Data mining RFID Visualising user diversity (2) aggregated using gdiv(A,B) Web mining

  18. Evaluating the application • Clustering only: Does it generate meaningful document groups? • yes (tradition in bibliometrics) – but: data? • Small expert evaluation of CiteseerCluster • Clustering & regrouping • End-user experiment with CiteseerCluster • 5-person formative user study of Damilicious

  19. Summary and (some) open questions • Damilicious: a tool that helps users in sense-making, exploring diversity, and re-using semantics • diversity measures when queries and result sets are different? • how to best present of diversity? • How to integrate into an environment supporting user and community contexts (e.g., Niederée et al. 2005)? • Incentives to use the functionalities? • how to find the best balance between similarity and diversity? • which measures of grouping diversity are most meaningful? • Extensional? • Intensional? Structure-based? Hybrid? (cf. ontology matching) • which other sources of user diversity? Thanks!

More Related