1 / 75

Tutorial : Aim and Scope

Clustering on the Web : Theory and Practices ADBIS’06 Tutorial Thessaloniki, Sept. 3 th , 2006 by Athena Vakali Dpt. of Informatics, Aristotle University, Thessaloniki, Greece. Tutorial : Aim and Scope.

watersj
Télécharger la présentation

Tutorial : Aim and Scope

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering on the Web : Theory and PracticesADBIS’06 TutorialThessaloniki, Sept. 3th, 2006by Athena VakaliDpt. of Informatics, Aristotle University, Thessaloniki, Greece

  2. Tutorial : Aim and Scope clarify, overview and categorize existing algorithms and practices in order to understand and assess the role of clustering on the Web. highlight the most important research and implementation issues raised in clustering on the Web in order to : • identify the types of information sources used for clustering, • emphasize the difficulties in clustering raised by the diversity of Web data structure and representation • understand how increasing in Web information accessibility may be realized by particular clustering techniques, • recognize existing clustering approaches in current applications frameworks and identify the trends ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  3. Presentation structure 14:30-15:30 Part I : Web Sources Clustering : A theoretical overview • Categorization of the types of data used for clustering – identify their main properties. • Web usage clustering - emphasis on web logs, sessions and user identification towards clustering. • Web documents-objects clustering – emphasis on vector models and page grouping 15:30-16:00 Part II :Web Clustering Frameworks and Implementations • Evaluation of the role of clustering in particular Web applications • Searching • Visualization • e-commerce • caching & prefetching • Identification of future trends in clustering on the Web 16:00-16:10 Tutorial Summary and Closure ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  4. Clustering on the Web : Basics ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  5. Web Applications Web User Web Data Web Search Web data clustering - Basics • Organize data circulated over the Web into groups / collections to facilitate data availability & accessing, and at the same time meet user preferences • Grouping Web objects into “classes” so that similar objects are in the same class and dissimilar Web objects are in different classes. • Discover patterns and relationships between data attributes and employ unsupervised learning cluster analysis : the art of finding groups in data ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  6. Some questions … • Why clustering on the Web ? • Which clustering approaches have been used ? • On what types of “Web data” has clustering been applied? • Which applications have been favored ? • What are the trends in Web clustering ? ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  7. Some answers …Clustering is an important way of organizinginformation. It assists in reducing searchspace and decreasing information retrievaltime, some benefits .. • increasing Web information accessibility • decreasing lengths in Web navigation pathways • improving Web users requests servicing • improving information retrieval • improving content delivery on the Web • understanding users’ navigation behavior • integrating various data representation standards • extending current Web information organizational practices ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  8. clustering approaches on the Web • Hierarchical clustering • Partitional clustering • Probabilistic clustering • Graph-based clustering • Fuzzy clustering • Neural Network based clustering • Hybrid approaches ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  9. Which applications have been favored ? • web searching • e-commerce and Web advertisement • caching and proxy management • topic hierarchies and sites organization • web-based transactions • … ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  10. What do we consider as “Web data” ? • Web documents • A collection of Web Pages (set of related Web resources, such as HTML files, XML files, images, applets, multimedia resources etc.) • Users’ logs and sessions • records of users’actions within a Web site are stored in a log file (each log file record has the client’s IP address, the date and time the request is received, the requested object and some additional information -such as protocol of request, size of the object etc.) • sessions : group of activities performed by a user from the moment the user enters a Web site to the moment the same user leaves it ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  11. web users clustering ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  12. web user clustering what is our “data” ? • 4 phases: • data preparation  data representation • similarity-proximity measures • cluster discovery • cluster analysis-assessment What is the cluster content ? ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  13. Users clustering/grouping • Pages clustering/grouping Web usage mining in practice … • mine the log files and discover • User patterns • Sessions • Pathways ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  14. Usage Data : originate from Web logs Web Server Log File: A Web log file is a collection of records of user requests for documents on a Web site : 216.239.46.60 - - [04/Jan/2003:14:56:50 +0200] "GET /~lpis/curriculum/C+Unix/Ergastiria/Week-7/filetype.c.txt HTTP/1.0" 304 - 216.239.46.100 - - [04/Jan/2003:14:57:33 +0200] "GET /~oswinds/top.html HTTP/1.0" 200 869 64.68.82.70 - - [04/Jan/2003:14:58:25 +0200] "GET /~lpis/systems/r-device/r_device_examples.html HTTP/1.0" 200 16792 216.239.46.133 - - [04/Jan/2003:14:58:27 +0200] "GET /~lpis/publications/crc-chapter1.html HTTP/1.0" 304 - 209.237.238.161 - - [04/Jan/2003:14:59:11 +0200] "GET /robots.txt HTTP/1.0" 404 276 209.237.238.161 - - [04/Jan/2003:14:59:12 +0200] "GET /teachers/vakali.html HTTP/1.0" 404 286 216.239.46.43 - - [04/Jan/2003:14:59:45 +0200] "GET /~oswinds/publication who visits ? which page ? when ? ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  15. Problems with the Web logs processing • too many log records due to the visiting of image files, etc • not adequate/detailed info is provided • there is no info about the content of the pages visited, • incomplete log recording due to the request servicing by proxies ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  16. Some Practices in Web logs processing Data Cleaning: removes log entries that are not needed for the mining process • e.g. images, css files etc • Typically, log entries are filtered: • log entries with filename suffixes such as gif,jpeg,jpg • the page requests made by the automated agents and spider programs • the log entries that have a status code of 400 and 500 series • POST data (i.e. CGI request). • Page Visiting Duration : Time difference between consecutive page requests. Why evaluating visiting page time? • the time spent on a page is a good measure of the user's interest in that page, providing an implicit rating for that page • Drawback : some users are left to a page because they have completed a search and they no longer wish to navigate, or because their … phone rang … ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  17. E 2 1 D C 3 D 2 C 1 A 1 B 3 Session 1: 2,1,3 Session 2 : 2, 1, 1, 3 … Pathway 1 : E/D/C Pathway 1 : D/C/A/B … Log Data representation • Sessions • session pathways- trees 216.239.46.60 - - [04/Jan/2003:14:56:50 +0200] "GET /~lpis/curriculum/C+Unix/Ergastiria/Week-7/filetype.c.txt HTTP/1.0" 304 - 216.239.46.60 - - [04/Jan/2003:14:57:33 +0200] "GET /~oswinds/top.html HTTP/1.0" 200 869 216.239.46.60 - - [04/Jan/2003:14:58:25 +0200] "GET /~lpis/systems/r-device/r_device_examples.html HTTP/1.0" 200 16792 216.239.46.133 - - [04/Jan/2003:14:58:27 +0200] "GET /~lpis/publications/crc-chapter1.html HTTP/1.0" 304 - 216.239.46.133 - - [04/Jan/2003:14:59:11 +0200] "GET /robots.txt HTTP/1.0" 404 276 216.239.46.133 - - [04/Jan/2003:14:59:12 +0200] "GET /teachers/vakali.html HTTP/1.0" 404 286 216.239.46.133 - - [04/Jan/2003:14:59:45 +0200] "GET /~oswinds/publication ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  18. User 1 Session 1 Session 2 2 3 3 3 2 3 2 1 3 1 3 1 3 1 1 3 1 3 1 3 User 2 Session 1 7 7 7 7 7 7 7 7 User 3 Session 1 Session 2 Session 3 1 5 1 5 1 3 1 1 3 1 5 1 5 5 1 1 1 1 1 1 1 1 1 1 1 examples for capturing web user browsing patterns • identification of unique users ? • Users with the same client IP are identical • identification of sessions ? • A new session is created when a new IPaddress is encountered or if the visiting page time exceeds a time threshold (i.e. 30 minutes) for the same IP-address • Each individual has a set Di={s1,s2,…,sni} where each s is a sequence that represent the observed record of page requests for individual i and the different sequences represent the different sessions. • Intra-session transactions can be obtained based on a model of user behavior (involves classifying references as “content” or “navigational” for each user) • Weights may be assigned to each Web page based on some measures of user interest (e.g., duration of viewing a Web page) ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  19. User sessions clustering ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  20. Why clustering users navigation sessions ? Clustering users’ navigation sessions : Groups together a set of users’ navigation sessions having similar characteristics benefits • User Grouping • Discover groups of users exhibiting similar browsing patterns • Web Page Grouping • Discover groups of pages having related content • based on how often URL references occur together across user sessions ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  21. A “generic” clustering approach for users’ navigation sessions • Determine the attributes to be used to estimate similarity between users’ sessions, or determine the users’ session representation • Determine the “strength” of the relationships between the attributes, or a similarity measure (correlation distance) • Apply clusteringalgorithms to determine the classes/clusters to which each user session will be assigned ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  22. Site Files Users’ Navigation Sessions Frequent Itemsets Clustering users’ navigation sessions :An overview [Cooley00] Data Preparation Usage Mining Sessions Clustering Usage Communities Usage Profiles Data Cleaning Session Identification Server Logs & Other Click-Stream Data Association-Rule Discovery Domain Knowledge ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  23. Algorithms for Sessions Clustering • Similarity-based clustering • parameters: distance functions-measures, number k of clusters • approaches • Hierarchical – e.g. determine a hierarchy of clustering, merging always the most similar clusters • Partitional - determine a “flat” clustering into k clusters (with minimal costs) • Model-based or Probabilistic clustering • parameters: number k of clusters • task : Determine a probability model for each cluster ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  24. Similarity-based session Clustering (I) • Originally, sessions clustering efforts considered sessions as un-ordered sets of “clicks”, where the number of common pages visited was a similarity indication between sessions (measures used : Euclidean dist., cosine measure, Jaccard coefficient etc). • Later on, it was recognized that the order of visiting pages is important, since for example visiting a page A after a page B is not the same information as knowing that both A and B belong to the same session. ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  25. Similarity-based session Clustering (II) measure the likeness between web users based on the information in snapshot web sessions. Most popular approaches are : • Sequence Alignment Method (SAM) [Wang02, Xiao01, Hay01,], where sessions are chronologically ordered sequences of page accesses. • represented web sessions as vectors of encoded page IDs andthen a clustering algorithm handling categorical data was employed. • SAM measures similarities between sessions, taking into account the sequential order of elements in a session. • define : Web pages similarity and then, sessions similarity (by a scoring function-dynamic programming method to match related sessions). SAM distance measure between two sessions is defined as the number of operations that are required in order to equalize the sessions. ROCK, CHAMELEON clustering ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  26. Similarity-based session Clustering (III) Graph-based clustering • Clickstream analysis ([Kothari03,Banerjee01])– how to evaluate similarities between two clickstreams ? • edit (or Levenhstein) distance : cost of transformations that result in two clickstreams to be identical • identify LCS (Largest Common Subsequence) : length of the largest subsequence common between two clickstreams • similarity between 2 clickstreams requires finding similarity/distance between 2 page views. Since semantic analysis is not possible, the degree of similarity between two page views is proportional to their relative frequency of cooccurrence. • Generalization-based clustering[Fu99] • rather than clustering the web users based on web sessions directly, sessions were generalized so that pages representing the similar semantics are collapsed and the dimension of clustering feature can be reduced significantly. • uses page URLs to construct a hierarchy, for categorizing the pages (identify the so-called ”general pages”). • then, the pages in each user session are replaced by the corresponding general pages and clustered BIRCH : Incremental clustering ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  27. Similarity-based session Clustering (IV) Recent work addressed the dynamic nature of web usage data. • A work [Nichele06] addressed the dynamic need to consider content along with usage from log file data : • extends theLCS technique by integrating page similarity to compute session similarity, since it considers the similarity between concepts when computing the LCS between sessions. • contribution is to leveragethe advantages of the domain taxonomy for computing session similarity. Conceptgeneralization is dynamically considered during session similarity computation,as opposed to static, approaches. • Historical web sessions (COWES) [Chen06] and Web Access Sequences (WAS) [Zhao06], have been proposed where web session trees are the basic structure for web log data representation. • similarity measures in the range [0,1] identify the similarity of session subtrees based on the frequently changed subtree patterns. Therefore, the proximity of web users is based on the characteristics of their usage data evolution. Graph-based clustering K-medoid clustering ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  28. Model-based (probabilistic) session clustering (I) • Model-based (or Probabilistic) clustering methods • optimize the “fit” between the given data and a mathematical model • based on the assumption that the data are generated from a probability distribution • Model-based (Probabilistic) clustering problem • Find the model structure • Find the model parameters for the structure that best fit the data in practice … • clustering user sessions according to the amount of time spent on common pages • when a user arrives at a Web site, his/her session is assigned to one of the clusters with some probability • given that a user’s session is in a cluster, his/her next request in that session is generated according to a probability distribution specific to that cluster. ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  29. Probabilistic session Clustering (II) Common practice : • assume a model for each cluster (the number of cluster is pre-determined) and find best fit of models to data using the Expectation-Maximization (EM) algorithm. • typically, cluster model : a finite-state Markov model with a number of parameters: • markov models popular to characterize probability of referencing page i after page j • several efforts included : First-order markov models or higher order markov models, HMMs (hidden markov models) etc. [Pallis05, Cadez03, Sen03, Baldi03, Anderson02, Ypma02,Desphpande01, Sarukkai00, Smyth99] • number of clusters is determined by using: • BIC (Bayesian Information Criterion) • Bootstrap methods or cross-validated likelihood • Bayesian approximations ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  30. Probabilistic session Clustering (III) building Markov Models from Web Log Files • Markov model: <S, Q, L> • S - state space containing all the nodes in link graph • Q - matrix of 1-step transition probabilities between nodes • L - initial probability distribution on the states in S • m-order Markov chain • the next page is dependent only on the last visited m pages • m-order n-step Markov chain • the n-th page to be visited in the future is dependent only on the last visited m pages Using Markov Models for Link Prediction • Link prediction is based on a m-order n-step Markov chain • given visited m pages, calculate the probability of visiting page a within the next n steps • the probability: weighted sum of probabilities of visiting page a at 1st to n-th step ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  31. Aristotle University Probabilistic session Clustering (IV) Link Graph [Zhu02]: an example ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  32. Probabilistic session Clustering (V) • Draw an individual i from the overall population. • The individual is assigned to one of K clusters , with probability p(ci=k), where ci indicates the cluster membership. • Each cluster k, , has a data generating model where Qk are the parameters of pk • Consider N individuals each having a data set Di . Let each Di consist of ni observations dij. Each dij represents another smaller data subset. Di now generated for an individual by once cluster membership ci=k is known and given Qk. ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  33. Probabilistic session Clustering (VI) • Given the definition of the likelihood function, the EM procedure becomes: • repeat • E step: a straight-forward evaluation of class conditional probability for each individual under each of the K cluster models using values of parameters Θ. • M step: update parameters Θ to obtain maximum likelihood • until a condition is satisfied • // e.g. a certain convergence criterion ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  34. Web documents clustering ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  35. Web-document clustering methods Clustering web documents : Group together a set of documents having similar characteristics pre-clustering methods offline clustering of the entiredocument collection post-clustering methods on-line clustering of the retrieved documentset by Web search engines ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  36. Steps in Web document clustering : • based on a given data representation model, a web document is represented as alogic data structure • similarity between documents is measured by using some similarity measuresthat is depended on the above logic structure. • with a cluster model, a clustering algorithm will build the clusters using thedata model and the similarity measure. Web document clustering in practice .. • Most commonly used is the Vector Space Model (VSM),which represents a web document as a feature vector of the terms thatappear in that document. • Each feature vector contains term weights (usuallyterm-frequencies) of the terms appearing in that document. • Similarity (popular measures : cosine and Jaccard) between webdocuments is measured by distance of the corresponding vectors. ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  37. Vector Space Model (VSM) in practice … The aim of computing weight of a selected term is to quantify the term’scontribution to ability to represent the source document topic. The focus of the VSMis how to choose terms from documents and how to weigh the selectedterms. ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  38. is VSM suitable for Web docs ? (I) • A problem with representing web documents in vectors is that certain information, such as • the order of term appearance, • term proximity, • term location within the document, and • any web specific information, is lost under the vector model. • Moreover, problems with VSM exist, since vectorrepresentations of documents are frequently very sparse. • Inverted files are used to prevent a tremendous computationaloverload which may be caused in large and diverse document collections (WWW downloaded pages). ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  39. Graph structures used instead of the VSM for Web docs In [Schenker04] web documents arerepresented by graphs which are then used in a classical clustering algorithm;the size of the maximum common subgraphis used tocalculate real-valued distances between pairs of graphs. Example : edges are labeled title (TI),link (L), or text (TX), i.e. the document has the title “YAHOO NEWS”, a link whose text reads “MORE NEWS”, and text containing “REUTERS NEWS SERVICE REPORTS”. ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  40. is VSM suitable for Web docs ? (II) • to reduce computation costs and space complexity, many popular methods for clustering webdocuments, including those using inverted files, usually assume a relatively small prefixed number of clusters. • as pointed out in [Friedman06] : • (a) the number of clusters should not be restricted by some relatively prefixed smallnumber, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts anew cluster • (b) a vector with multiple appearance in the training set is counted as n distinct vectors rather than asingle vector. • Several new crisp and fuzzy approaches are proposed based on the cosine similarity principle for clustering documents which are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Also, to devise a new post-clustering another recent work [Im05] have used the notion of “concept vector” to propose an algorithm named “Fuzzy ConceptART(FCART)”. ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  41. Clustering documents based on phrases (I) • As recognized by several other earlier works, document clustering should be based not onlyon single word analysis, but on phrases as well, i.e. similarity between documentsshould be based on matching phrases rather than onsingle words . • A novel phrase-based document indexing model is given in [Hammouda04]with the use of Document Index Graph (DIG) that captures thestructure of sentences in the document set, ratherthan single words only. The DIG model is based ongraph theory and utilizes graph properties to matchany-length phrase from a document to any number ofpreviously seen documents in a time nearly proportionalto the number of words of the document. • A phrase-based similarity measure is used for scoring thesimilarity between two documents according to thematching phrases and their significance. • An incremental document clustering methodis proposed basedon maintaining high cluster cohesiveness. ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  42. Clustering documents based on phrases (II) ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  43. Web documents clustering : levels of granularity and extensions to typical VSM • In [Huang04]several types of features are used together to perform clustering,inMulti-type Features based Reinforcement Clustering(MFRC) which does not use a unique combine score for all featurespaces, but uses the intermediate clustering result in one feature space as additionalinformation to gradually enhance clustering in other spaces. • Another recent work [Huang06] proposed and an Expanded Vector Space Model (EVSM) for Web document representation basedon granulation. Web documents are represented in many-levelknowledge granularity with sufficiently conceptualsentences to understand valuable relationshidden in data. • With granularity calculation data can be more efficiently andeffectively disposed & knowledge engineers can handle the same dataset indifferent knowledge levels. This provides more reliable soundness forinterpreting results of various data analysis methods. ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  44. Clustering practices in Web documents grouping (I) • Clustering of Web documents helps to discover groups of pages having related content. Moreover, techniques used to recognize and group hypertext nodes into cohesive documents can improve information retrieval results. • Web communities • A set of Web pages that link to more Web pages in the community than to pages outside of the community • A web community enables web crawlers to effectively focus on narrow but topically related subsets of the web. • Logical document • A set of Web pages with similar content • Benefits • Improves Web information retrieval (e.g. search engines) • Improves content delivery on the Web • Compound document … ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  45. Clustering practices in Web documents grouping (II) : Web Communities Web communities were proposed [Greco04, Flake03, Flake00] on the basis of the evolution of an initial set of hubs and authoritative pages, such that the behavior of users is captured with respect to the popularity of existing pages for the topic of interest • HITS & PageRank • Graph cuts and partitions, Maximum Flow and Minimal Cuts • Other methods : • Bibliometric methods: They define a notion of similarity for pages that do not directly link to one another • Bipartite cores: They consist of pages that have high bibliographic metrics with respect to each other ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  46. Clustering practices in Web documents grouping (III) : Logical Info Units • A set of Web pages with similar content. • Then, a data unit for the Web data retrieval should not be a page but a connected subgraph corresponding to one logical document • Introduce the concept of route links [Tajima99], then rank minimal subgraphs under a given query and consider distribution of query keywords within subgraphs ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  47. Clustering practices in Web documents grouping (IV) :Compound Documents • A compound document [Eiron03], [Lara99], is a set of URLs that contains at least a tree embedded within the document. Necessary condition for a set of URLs to form a compound document : their link graph should contain a vertex that has a path to every other part of the document. Typically compound docs identified by clustering are : • linear paths: there is a single ordered path through the document, and navigation to other parts of the document are usually secondary (e.g. news sites with next link at the bottom) • Fully connected: These types of documents have on each page, links to all other pages of the document (e.g. short technical docs and presentations) • Wheel documents: They contain a table of contents and have links from this single table of contents to the individual sections of the document (toc is a kind of “hub” for the document) • Multi-level documents: Complex documents that may contain irregular link structures such as multilevel table of contents ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  48. Part II - Clustering in Web-based Applications ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  49. Clustering users’ navigation sessions – benefits (1) • Users grouping helps to discover groups of users with similar navigation patterns • Provide personalized Web content • Web personalization : any action that adapts the information or servicesprovided by a Web site to the needs of a particular user (or a set of users) • Benefits • discover the preference and needs of individual Web users in order to provide personalized Web info for certain types of users • examine generalized user navigation patterns in order to understand how users use the site ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

  50. Clustering users’ navigation sessions – benefits (2) • Provide information about : • What are the set of pages frequently accessed together by Web users? (frequent itemsets) • What page will be fetched next? (association rules) • What are the paths frequently traversed by Web users? (sequential patterns) • Clustering Web users’ sessions is useful in order to : • improve Web site design and organization • develop prefetching and Web caching policies • recommend related pages • collect business info about Web users behavior Applications : e-commerce, e-learning, e-Gov … ADBIS’06 Tutorial Web clustering, by A. Vakali, Thessaloniki, Sept.3rd 2006

More Related