120 likes | 205 Vues
FALL 2012 DSCI5240 Graduate Presentation. By Xxxxxxx. Outline. Web Usage Mining. Definition and Goal Source and Type of data Data Collection and Pre-processing Data Modeling Discovery and Analysis of Web Usage Patterns. Web Usage Mining. Definition and Goal
E N D
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx
Outline Web Usage Mining • Definition and Goal • Source and Type of data • Data Collection and Pre-processing • Data Modeling • Discovery and Analysis of Web Usage Patterns
Web Usage Mining • Definition and Goal • Automatic discovery and analysis of patterns • Goal: Capture, model and analyze the behavior pattern and profiles of users interacting with web sites. • Source and Type of Data: • Server log files: Web Server and Applications access • Site files and meta data • Operational databases • Application Templates • Domain Knowledge • Internet Service Provider data collection
Data Collection • Web sites and Applications data • Primary source of data in Web Usage Mining • Each HTTP request generates a single entry in the server access logs • Log entry: time and date of request; IP address; resource requested; HTTP method; User Agent(Browser and Operating System); referring web resource; client-side cookies. • 12006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727) http://dataminingresources.blogspot.com/22006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf
Data Abstraction • Pageview: collection of web objects or resource corresponding to a single “user event”. Example: reading an article; view a product page; adding a product to a shopping cart. • Session: sequence of pageviews by a single user during a single visit. • Content Data: objects and relationships suggested to the user(Text and images). • User data: operational database(Ex: user profile information, visit histories…)
Web Usage Data Pre-Processing • Data Fusion : merging of log files from several web and application servers: • shared embedded session ids • heuristic methods based on the “referrer” field in server logs • Data cleaning : removing useless data such as references including style files, graphics or sound files
Web Usage Data Pre-Processing(Continue) • Pageview identification attributes: • pageview ID (URL uniquely representing the page viewed); static pageview Type(ex: information page, product page); Metadata(keywords) • User Identification: • User authentication mechanism(User activity record) • Use of client-side cookies • Sessionization • Each user activity record represents a single vist to the site or a session. • An episode is a subset or subsequence of a session comprised of semantically or functionally related pageviews
Web Usage Data Pre-Processing (Continue) • Path Completion • To solve missing references due to client or proxy-side caching. When a user returns to the previous page, the version of the download of that page will still the same due to caching. • Data Integration • User data (e.g., demographics, ratings, and purchase histories) and product attributes and categories from operational databases. • Building a content enhanced transaction data • Multiplying user-pageview matrix and the transpose of the term-pageview matrix). read BamshadMobasher, ch12: Web Usage Mining pp14-18)
Discovery and Analysis of Web Usage Patterns • Session and Visitor Analysis • data is aggregated by predeter-mined units such as days, sessions, visitors, or domains • Reports on most frequently accessed pages, average view time of a page, average length of a path through a site, common entry and exit points. • useful for improving the system performance, and providing support for marketing decisions. • Online Analytical Processing (OLAP)provides a more integrated framework for analysis with a higher degree of flexibility.
Discovery and Analysis of Web Usage Patterns • Page clusters (or items) Cluster Analysis and Visitor Segmentation Based on the usage data (i.e., starting from the user sessions or transaction data): items commonly accessed and purchased automatically organized into groups Based on the content features associated with pages or items (keywords or product at-tributes): collections of pages or products related to the same topic or category. It can also be used to provide permanent or dynamic HTML pages that suggest related hyperlinks to the users according to their past history of navigational or purchase activities • Recall that Clustering is a data mining technique that groups together a set of items having similar characteristics. • User clusters : most used • Clustering of user records (sessions or transactions) • Establish groups of users exhibiting similar browsing patterns. • Useful for providing personalized Web content to similar users
Association and Correlation Analysis • Can found groups of items or pages that are commonly accessed or purchased together. • Enables Web sites to provide effective cross-sale product recommendations. • One problem for association rule recommendation systems is that a system cannot give any recommendations when the dataset is sparse. • Recall an association rule is an expression of the form X→Y [sup, conf], where X and Y are itemsets, sup is the support of the itemsetX ∪ Y representing the probability that X and Y occur together in a transaction, and confis the confidence of the rule, defined by sup(X∪Y) / sup(X), representing the conditional probability that Y occurs in a transaction given that X has occurred in that transaction. Resource: Web Usage Mining By BamshadMobasher ; http://maya.cs.depaul.edu/~mobasher/papers/12-web-usage-mining.pdf