1 / 67

Web Mining: An Overview Of Web Analytics with Examples

Web Mining: An Overview Of Web Analytics with Examples. Donghui Wu, Ph.D. Oracle Corporation April 16 th 2003. Agenda. Web Mining Overview Basic Web Analysis Problems Data Warehouse Solutions Oracle 9iAS Clickstream Intelligence Demo Site Configure Excerpts

jimbo
Télécharger la présentation

Web Mining: An Overview Of Web Analytics with Examples

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Mining: An Overview Of Web Analytics with Examples Donghui Wu, Ph.D. Oracle Corporation April 16th 2003

  2. Agenda • Web Mining Overview • Basic Web Analysis Problems • Data Warehouse Solutions • Oracle 9iAS Clickstream Intelligence Demo • Site Configure Excerpts • Site Basic Statistics Examples • Business Scenario Examples

  3. Web Mining Web Mining, generally speaking, is the activity of applying data mining principles and process to Web domain. It may tackle the World Wide Web as a whole, or focus on a particular (group) of Web sites (servers) In this talk, we will limited the scope to Web usage and pattern analysis, or, more specifically Web Log Mining, at the enterprise (Web sites) level. In industry, it is also referred as Web Analytics.

  4. Web Analytics • Web Analytics is the monitoring and reporting of Web site usage so that enterprises can better understand the complex interactions between Web visitor actions and Web site offers, and leverage that insight to optimize the site for increased customer loyalty and sales. • FromWeb Analytics :Making Business Sense of Online Behavior, Aberdeen Group, June 2002

  5. Web Mining and Privacy • Privacy issue is always a concern for data mining projects. • When analyzing/mining visitor online behaviors, in particular visitor / user profiling, privacy issue is a major concern • Usually only the aggregated info are analyzed, not the individual visitor’s/user’s

  6. Web Log Data Sources (1) • Web Server Log • This is the server log at the Web server, easy to get, and most widely analyzed. • It is logged at the destination. The analysis is about a particular Web server or servers. • One Web server can host many Web sites, and one Web site may served by multiple Web servers. • Proxy Server Log • If the Web connection is through a proxy, every requests are logged at the proxy server as well. • It’s logged the origin. The analysis is about a group users, e.g. all users within a company.

  7. Web Log Data Sources (2) • Client Side Browser Log • Embeded client-side collection. It requires sending simple javascripts with the the response to the Browser, and will collect browser info, and visitor client side activity, e.g. mouse movement, to a collector server for analysis • Application Log • Web application usually has its own logs at various details and for various purposes

  8. Server Log, Proxy Log, and Browser Log

  9. Web Server Log Analysis and Mining • From now on, we limited our subject to Web Server Log Analysis and Mining only. • The emphasis is on Enterprise Web Analytics. • We will use a fiction site drugdepo.com as sample analysis, and Oracle 9iAS Clickstream Intelligence to produce the sample analysis.

  10. Web Analytics Tasks Category • Site Activity and Operation Site traffic, performance and status • Usage Mining Visitor Behavior Analysis, Referrer analysis, Path Analysis • User Profiling/Clustering Visitor Profiling, visitor segmentation User profiling, user segmentation

  11. Web Analytics Tasks for Business Users • Content effectiveness evaluation • Online marketing campaign analysis • Target marketing analysis • Personalization and recommendation • Cross-sell and up-sell opportunities • Many more…

  12. Data Mining Techniques in Web Analytics The following data mining techniques may be applied to solve those problems: • Association Rule Mining • Clustering / Segmentation • Visitor / User • Pages • Visitor/User Profiling

  13. Web Mining Difficulties • Data size is huge • For site with 1 million hits per day, the raw log file size can be 500M to 1 G per day depending Web server configure • Bad records • There are many bad records due to Server errors. • Lack exact information • In many cases, heuristics have to be applied

  14. Web Server Log Format • NCSA Common Log Format • NCSA Extended Common Log Format • W3C Extended Common Log Format For more information, see W3C website

  15. NCSA Common Log Format The following is a line in an Apache server log. It is in NCSA Common Log Format, and has the following fields separated by a space. Host Ident Authuser Time Request Status BytesSent Refer Browser 24.69.48.18 - 709697D0CE694757E034080020CB1B7C [01/Nov/2000:23:59:05 -0800] "GET /products/forms/pdf/256629.pdf HTTP/1.0" 206 308928 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"

  16. Dynamic Page and Parameters • In the previous example, the requested page is a static page. • For dynamic pages: e.g. ASP, JSP, etc. The request has two parts: The static URL stem and query separated by “?” • The query string is consisted of “paremeter=value” pairs. • Parameters provide detailed info of the request.

  17. Web Log Mining Task Types • Web Log Analyzer • Provide simple statistics, e.g. # of visitor, # of page view, # of sessions, etc. at given time • Web Log Mining • Web Usage Mining and Pattern Analysis • E-commerce, Personalization and CRM • Integrate and mining data across enterprise

  18. Related Terms • Hits • A hit is a URL request in server log • Page Views (Page Impressions) • A page view may require multiple requests. E.g. several .gif or .jpeg requests plus a .html requests • Data Sent • Visitors ( identified and unidentified visitors) • Users (Authenticated Visitors) • Sessions

  19. Data Filtering Data analysis purpose, the following data preparationa are often applied: • Remove .gif or .jpeg and other non-essential requests in raw data • Some other filtering may also be applied based on tasks under attack. • Page construction rules, to consolidate records

  20. Basic Processing • Parsing Log, resolve the following: • Client IP address • Visitor ID • User ID • Browser and OS • Request • Session

  21. Basic Tasks For any Web Analytics, you need to resolve the following before any possible analysis: • Visitor identification • User identification / matching • Session Construction • Path Completion

  22. Visitor Identification Methods • Client Hostname or IP Address only • IP Address + Browser String • Query String Parameter • Cookie Value • Visitor Field

  23. IP Method Limitations • Single IP / Multiple Users • A single proxy server can sever many users. • Multiple IP / Single User • A single user may use multiple machines over time, or even in one session. For example, AOL dynamically assign IP address to every request • Always configure your web server to use cookie or query string if possible

  24. Session Identification • Visitor ID and Timeout Period • Once Visitor ID is constructed, the requests with the same Visitor ID are sequenced according to the timestamp, the time the requests were made. If between two requests the time difference is more than, say 30 minutes, then the sequence is break into two sessions. • Query String Parameter • In the request query string • Cookie Value • Session Field

  25. User Identification • Web Server Authentication • Query String Parameter • Cookie Value • A cookie is a small text file that stores information about a visitor on the user’s PC

  26. Web Analytics Solution Types • Simple Web Log Analyzer • Many free ones, simple parsing and counting • WebTrend Web Log Analyzer • Data Warehouse Solutions • WebTrend E-commerce Server • Oracle 9iAS Clikcstream Intelligence • Hosting Solutions • Digimine • Consulting Solutions • Many companies specialized in customized Web Log and Application Log analysis

  27. Web Log Analyzer • Web Log Analyzer - Report simple site usage measures, e.g. # of hits, # of visitors, page sequence, etc. • Methodology: simple parsing and counting • Small and quick, but only produce simple static reports, usually with big error margin

  28. Data Warehouse Solutions • Load Server Log into Data Warehouse • Integrate with other data, e.g. sales • Support interactive query and OLAP • More accurate analysis and data mining results • Expensive

  29. Simplified DW Scheme:Dimensions • Date • Time • Visitor • User • Browser • Client Host

  30. Date Time of Day Browser Client Host User Visitor Page Server Site Event Referrer Search Simplified DW Scheme:Dimensions

  31. Impression (page view) Browser Client Host Visitor User Page Time to Serve Referrer Status Event Server Session ID Session Fact Session Date Session Time Session Visitor ID Session User ID Session Duration # of Impressions Data Sent First Impression Id Last Impression ID First referrer Simplified DW Scheme:Facts

  32. Impression Fact

  33. Session Fact

  34. ETL Process and external data The ETL process can be customized to support business analysis according to: • Web server log format • External customer data • External sales data and marketing data • Other external data sources

  35. Demo and Scenarios

  36. Oracle Warehouse Builder Collector Server Oracle 9iAS Clickstream Intelligence Loader Staging Star Schema Partitioning Oracle 9i

  37. Agenda • Configuration • Basic Site Statistics • Business Scenarios

  38. DrugDepo Site Configuration

  39. Site Basic Statistics Site: DrugDepo.com Start Date: October 1 End Date: October 10

More Related