Mining Click-stream Data With Statistical and Rule-based Methods

Mining Click-stream Data With Statistical and Rule-based Methods Martin Labský, Vladimír Laš, Petr Berka Universityof Economics, Prague

The Clickstream Data • 3 617 171 page requests containing: unix time; IP address; session ID; page request; referer • 522 410 sessions, out of them only 203 887 with length > 1 • 100 000 training set • 60 000 testing set • 43 887 heldout set Discovery Challenge 2005

Clickstream Data Preprocessing unix time ;IP address ; session ID ; page request; referee 1074589200;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/dp/?id=124 ;www.google.cz; 1074589201;194.213.35.234;3995b2c0599f1782e2b40582823b1c94;/dp/?id=182 ; 1074589202;194.138.39.56 ;2fd3213f2edaf82b27562d28a2a747aa;/ ;www.seznam.cz; 1074589233;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/dp/?id=148 ;/dp/?id=124; 1074589245;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/sb/ ;/dp/?id=148; 1074589248;194.138.39.56 ;2fd3213f2edaf82b27562d28a2a747aa;/contacts/ ; /; 1074589290;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/sb/ ;/sb/; • Sequences of page visits in each session (same sessionID) were constructed from the www log data • Sequences of page types [start, dp, dp, sb, sb, end] • Sequences of products [start, 124, 148] Discovery Challenge 2005

Predicting New Page in a Sequence • Problem • Observing a sequence of pages A1A2…An-1 what will be the next page An? • Methods • Markov n-gram models • Decision rules Discovery Challenge 2005

Markov N-gram Predictor (1/5) • Probability of a sequence A1A2….An each term (interpolated k-gram distribution) computed as Discovery Challenge 2005

Markov N-gram Predictor (2/5) where n(xy) is the occurrence of sequence xy in data and Discovery Challenge 2005

Markov N-gram Predictor (3/5) • Building model using EM algorithm 1. compute Pi(i=1,…,k) from counts of sequences observed in the training set DTR 2. assign non-zero initial values to weights i 3. repeat 3.1 compute the probability of the holdout set using the interpolated distribution 3.2 modify the weights Discovery Challenge 2005

Markov N-gram Predictor (4/5) • Results for page types Discovery Challenge 2005

Markov N-gram Predictor (5/5) • Results for product types Discovery Challenge 2005

Rule Induction Algorithms (1/5) • “Classical “ Decision rules in the form Ant => Class (p) where Ant is a conjunction of values of input attributes, p = n(Ant Class)/n(Ant) • Decision rules for clickstreams are in the form Ant => page (p) where Ant is a sequence of pages, p = n(Ant//page)/n(Ant) Discovery Challenge 2005

Rule induction algorithms (2/5) • Compositional algorithm • Set-covering algorithm 1. find a rule that covers some positive examples and no negative example of a given class (concept) 2. remove covered examples from the training set DTR 3. if DTR contains some positive examples (not covered so far) goto 1, else end 1. add empty rule to the rule set KB 2. repeat 2.1 find by rule specialization a rule Ant => Class that fulfils the user given criteria on lengths and validity 2.2 if this rule significantly improves the set of rules KB build so far (we test using the chi2 test the difference between the rule validity and the result of classification of an example covered by Ant) then add the rule to KB Discovery Challenge 2005

Rule induction algorithms (3/5) • Compositional algorithm • Set-covering algorithm rule specialization extends the antecedent sequence by any sequence member from left the decision if the antecedent sequence should be specialized is made by a chi2 test adding a rule CDX to the rule DX changes the rule DX into a rule meaning (D but not CD)X rule specialization extends the antecedent sequence by any sequence member from left the decision if new rule should be added is made by a chi2 test Discovery Challenge 2005

Rule induction algorithms (4/5) Rule-based classification • Set-covering algorithm • apply single rule • Compositional algorithm • combine conritbutions of all applicable rules using pseudo-bayesian formula Discovery Challenge 2005

Rule induction algorithms (5/5) Examples of rules • For the page types sequences • dp, sb -> sb (Ant: 5174; AntSuc: 4801; P: 93%) • ct -> end (Ant: 5502; AntSuc: 1759; P: 32%) • faq -> help (Ant: 594; AntSuc: 127; P: 21%) • For the products sequences • loud-speakers -> video (Ant: 14840, AntSuc: 3785, P: 26%) • data cables -> telephones (Ant: 2560, AntSuc: 565, P: 22%) • PC peripheries -> telephones (Ant: 8671, AntSuc: 1823, P: 21%) Discovery Challenge 2005

Results of testing Discovery Challenge 2005

Conclusions • Comparison of methods • N-gram models as exhaustive sets of compositional rules Pi(c|a…b) ~ a…b  c • Set covering algorithm for exhaustive non comp. rules • Compositional algorithm for non exhaustive comp. rules • Comparison of results • N-gram comparable with set covering (slightly better), worst results for compositional algorithm • All algorithms can be applied by web servers to recommend relevant pages to their users, and to identify interesting patterns in their log files. Discovery Challenge 2005

Thank you for your attention.

Mining Click-stream Data With Statistical and Rule-based Methods

Mining Click-stream Data With Statistical and Rule-based Methods

Presentation Transcript

Data Mining: Association Rule

Association Rule Mining Constraint Based Association Rule Mining

Data Stream Mining

Statistical Data Mining - 3

Data Mining Methods Course

Semantic-based Trajectory Data Mining Methods

statistical analysis and data mining

Entity extraction: rule-based methods

Data Stream Mining and Incremental Discretization

Statistical Methods for Mining Big Text Data

Data Stream Mining

Debellor Data Mining Platform with Stream Architecture

Statistical Data Mining - 2

Data Mining with Naïve Bayesian Methods

Statistical Methods for Text Mining

Data Mining: Association Rule Mining

Data Mining Association Rule Mining

Ceng 714 Data Mining Association Rule Mining

Debellor Data Mining Platform with Stream Architecture

Data Stream Mining

Semantic-based Trajectory Data Mining Methods

Statistical Data Mining - 3