Créer une présentation
Télécharger la présentation

Download

Download Presentation

Modelling of Interaction Dynamics in Internet-based Multi-user Scenarios

154 Vues
Download Presentation

Télécharger la présentation
## Modelling of Interaction Dynamics in Internet-based Multi-user Scenarios

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Modelling of Interaction Dynamics in Internet-based**Multi-user Scenarios Ata Kabán http://www.cs.bham.ac.uk/~axk a.kaban@cs.bham.ac.uk School of Computer Science The University of Birmingham 15th November 2005**Overview**• Introduction • A dynamic model for online community id • Convex linear and nonlinear models of heterogeneous sequence collections • Prediction & data exploration • Experiments & Applications • Conclusions • “The most important goal for theoretical computer science in 1950-2000 was to understand the von Neumann computer. The most important goal for theoretical computer science from 2000 onwards is to understand the Internet” • Christos H. Papadimitriou**Introduction**• Scenarios • Direct interaction [online discussion stream] • Indirect interactions [browsing] • Challenges • Heterogeneous behaviour • Apparently highly entropic • Need parsimonious & efficient profiles • Both predictive & explanatory • Need to provide scalable algorithms**Why bother?**• Trying to understand a complex system is scientifically interesting • The ability to analyse & predict individual activity is practically useful • To infer and predict user / costumer preferences • To understand user behaviour • To provide a basis for personalised environments based on history of activity • Examples: profiling consumer brand preferences; occupational mobility; web browsing; phone usage; command usage; etc • Evidence (data) is cheap to acquire • traces of user activity logs – most often symbolic sequences**Nothing works…**• Community identification • Long standing problem • Clustering approaches • But: What if temporal events are subject to random delays • What if there are no distinct homogenous groups? • Prediction • Existing methods are global: They assume that all users are the same • One cannot observe each user long enough to build a fully personalised predictor**A Dynamic Bibliometric Model…**Reference: X Wang & A Kaban, SIAM Data Mining 06, submitted • Data of one day worth discussions: T=25,355 contributions from W=844 chat participants. <bigboy> …. <xxx> … <yy> … <xxx> … <bigboy> ... <yy> … <bigboy> … <xxx> … <tinigirl> … <uuu> … <tinigirl> … The 1st order connectivity graph**A Dynamic Bibliometric Model…**The Aggregate Markov model (Saul&Pereira’97) • the ‘group’ in a latent variable AGGREGATE MARKOV**But:**Over 800 people typing concurrently at different terminals over the world Contributions arrive in a sequential order What are the ‘true’ interactions? – what is the ‘real’ connectivity graph? A Dynamic Bibliometric Model…**Mixed Memory Markov model**(Raftery ’85, ’94; Saul & Jordan ‘99) parsimonious approximation to the higher order Markov model • 2 variants of the model • Limit distribution studied • Parsimonious use of parameters • Numerical optimisation algorithms • No clustering ability**A Dynamic Bibliometric Model…**The Aggregate Mixed Memory Markov Model – or the Mixed Memory Aggregate Markov Model • The ‘true temporal connection’ is a latent variable AGGREGATE MIXED MEMORY MARKOV MIXED MEMORY AGGREGATE MARKOV**ML estimation algorithm**• Iterate until convergence (equivalent with EM, space-efficient) • Each iteration scales linearly with the observed L-grams**Model (order) selection using AIC**Aggregate Markov best model for this data Aggregate Markov Number of groups Number of groups (C=9,L=9)-Aggregate-Mixed-Memory Markov model wins Too many parameters…**A Dynamic Bibliometric Model for the Identification of**Online Communities Reordered according to the inferred state clusters Inferred connections**The distribution of influential time lags**Direct interactions [online chat] Indirect interactions [browsing]**Heterogeneous collections of sequences: e.g. browsing traces**• - Individuals all interact with the same media independently • No clusters of sites found, as defined by the browsing paths globally • Tools are needed to capture the heterogeneity of individual activity traits together with global activity traits**Global models**cannot capture heterogeneity – treat all individuals the same Individual models individual activity history typically too short for obtaining a reliable estimate for N individual we would need to store N times the nos of parameters of the sequence model employed Mixtures of sequence models (MMC) assume homogeneous prototypical behaviour within group cannot capture multiple relationships Probabilistic Models for Multiple Sequences: State of the Art Ref: Cadez et al ‘04 MMC**The need for distributed sequence models**• Necessary trade-off between the definition of global and individual-specific representations • Common behavioural patterns are the basis of multiple relationships between individuals • May yield a more realistic model exhibited by the population as a whole • Parsimonious representation**Simplicial Mixtures of Markov Chains**References: Girolami & Kaban, NIPS ’03Longer version in Data Mining and Knowledge Discovery, 10:3, 2005. • Now x is a continuous latent variable • Exact estimation becomes intractable • Approximate estimation techniques employed: MAP, variational Bayes • In both cases simple algorithm with linear scaling obtained SMMC**Single cause prior:**• Multiple cause prior:**VB estimation**Solving for T, x, α, Q, then replacing Q in the updates which contain it yields simple multiplicative updates similar to NMF.**Algorithm**• Iterate until convergence: • Linear in the number of observed transitions**Application: Telephone Usage Modelling**• 1,172,578 calls in week1 • 1,753,304 calls in week 2 • Destination numbers mapped to 87 geographic regions & mobile operators • Week1 activity employed for estimation • Week2 activity used for testing • Performance measures considered: • Predictive perplexity on unseen sequences • Percentage of symbols correctly predicted on unseen seq • Out of sample log likelihood • Parameter interpretability assessed**Prediction error on transactions in Week2**Solid straight line: global 1st order MC Solid line: SMMC (estimated with VB) Dashed line: SMMC (estimated with MAP) Dash-dot line: MMC**Explanatory user profiles**Example of activity-profile of one of the customers over a K=20 component SMMC (one point on a 19-D latent simplex). Each of these components was a 1st order Markov Chain EP(x|Seq_n)[x]**Application: Web browsing behaviour prediction**• Dataset previously used in Cadez et al. • 17 page categories from MSN website form the common state space • Users who visited at least 9 out of 17 page categories selected for this experiment • Total 119,667 page requests over 1,480 web browsing sessions (small data set)**10-fold cross-validated predictive perplexity**Solid straight line: global 1-st order MC Solid line: SMMC (estimated with VB) Dashed line: SMMC (estimated with MAP) Dash-dot line: MMC**Complexity of the component MCs measured as the distribution**of entropy rates Low complexity -favours predictability -favours interpretability**5 selected basis-transitions**SMMC component MCs [separates common behaviour into one component] MMC cluster-prototypes [common behaviour superimposed on all prototypes] black=0, white=1**So far so good…**• Community finding from direct online interactions by using discrete latent variables to infer the ‘true’ connections and the cluster membership • Distributed modelling of heterogeneous activity traces by using a continuous latent variable to capture the spread**Sample size issues**- The estimation of mixtures needs a large number of sequences- The estimation of simplicial mixtures needs long (rich) sequences**Topographic Mixtures of Sequence Models**Reference: A Kaban, Proc. ITCC’05 • Exact estimation is intractable • A sampling employed**The estimation algorithm**• Iterate until convergence: • Each iteration scales linearly with the number of non-zero elements in the data! - Scalable Generative Topographic Mapping (SGTM)**Prediction with distributed sequence models**• Combines basis-wise predictions in proportions specified by the posterior expectation • User-specific deeper past (w.r.t. the global trait) is embodied in the posterior expectation • In consequence neither a simplicial mixture of 1st order MCs nor a topographic mixture of 1st order MCs is a 1st order model**Illustration of the representationPrototype vs. aspects view**It can be shown that the model estimation algorithm minimises a weighted sum of entropies of the parameters.**Visualisation of large document collections**10-Newsgroups text collection**Aspect-level map of the estimated topical components at**equidistant locations of the latent space**CPU Time**Computational demand drastically reduced in comparison with existing probabilistic topographic models for discrete data**Application: Predictive modelling and exploratory analysis**of dynamic user behaviour from a large web log collection • Using the big mnbc.com web log sequence collection previously used in Cadez et al. • Training on randomly chosen 100,000 user traces, totalling 801,745 page requests • Testing on further, previously unseen 88,181 user trances, totalling 714,280 page requests • Evaluation criteria used: • Generalisation (out of sample log likelihood) • Prediction (out of sample predictive perplexity) – varying sample size issues studied • Visualisation and exploratory analysis**A summary of 100,000 browsing traces: Lists of the most**probable sequences at equal locations of the latent space**Model space view**Map of state transition components estimated from the browsing sequence data set white=0 black=1**Explanatory user profiles extracted from the same model**Prototype view Aspect view**Prototype view**Aspect view User Profile 2 User Profile 3**Common behaviour component**Different topologies… Grouping-specific behaviour components**Conclusions**• Consistent generative probabilistic framework • Discrete latent variables used for inferring state groupings and for inferring influential past states • Continuous latent variables used for representing heterogeneous sequence sets in terms of common patterns • Linear time algorithms obtained • Tested in real applications • Simple structures found behind complex observations • Improved prediction for previously unseen individuals • Efficient compression / low entropy parameters • Interpretable parameters**References**• X Wang & A Kabán: A Dynamic Bibliometric Model for the Identification of Online Communities, Submitted to SIAM DM’06. • A Kabán: A Scalable Generative Topographic Mapping for Sparse Data Sequences. Proc International Conference on Information Systems: Coding and Computing (ITCC’05). • M Girolami & A Kabán: Simplicial Mixtures of Markov Chains: Distributed Modelling of Dynamic User Profiles. Advances in Neural Information Processing (NIPS’03). (Extended version in Journal of Data Mining and Knowledge Discovery. 10:3, 2005) • A Kabán & X Wang: Context-based Identification of Communities from Internet Chat, Proc. IJCNN’04.