1 / 28

Web Mining Research: A Survey

April 23rd 2014 CS332 Data Mining. pg 01. Web Mining Research: A Survey. Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson. pg 02. outline. Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Review Exam Questions. pg 03.

tirza
Télécharger la présentation

Web Mining Research: A Survey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. April 23rd 2014 CS332 Data Mining pg 01 Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson

  2. pg 02 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions

  3. pg 03 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions

  4. pg 04 Introduction “The Web is huge, diverse, and dynamic . . . we are currently drowning in information and facing information overload.” Web users encounter problems: • Finding relevant information • Creating new knowledge out of the information available on the Web • Personalization of the information • Learning about consumers or individual users

  5. pg 05 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions

  6. pg 06 Web Mining “Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services.” Web mining subtasks: • Resource finding • Information selection and pre-processing • Generalization • Analysis

  7. pg 07 Web Mining Information Retrieval & Information Extraction • Information Retrieval (IR) • the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant as possible • Information Extraction (IE) • transforming a collection of documents into information that is more readily digested and analyzed

  8. pg 08 Live demo

  9. pg 09 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions

  10. pg 10 Web Content Mining Information Retrieval View Unstructured Documents • Most utilizes “bag of words” representation to generate documents features • ignores the sequence in which the words occur • Document features can be reduced with selection algorithms • ie. information gain • Possible alternative document feature representations: • word positions in the document • phrases/terms (ie. “annual interest rate”) Semi-Structured Documents • Utilize additional structural information gleaned from the document • HTML markup (intra-document structure) • HTML links (inter-document structure)

  11. pg 11 Web content mining, IR unstructured documents

  12. pg 12 Web content mining, IR semi structured documents

  13. pg 13 Web Content Mining Database View “the Database view tries . . . to transform a Web site to become a database so that . . . querying on the Web become[s] possible.” • Uses Object Exchange Model (OEM) • represents semi-structured data by a labeled graph • Database view algorithms typically start from manually selected Web sites • site-specific parsers • Database view algorithms produce: • extract document level schema or DataGuides • structural summary of semi-structured data • extract frequent substructures (sub-schema) • multi-layered database • each layer is obtained by generalizations on lower layers

  14. pg 14 Web content mining, Database view

  15. pg 15 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions

  16. pg 16 Web Structure Mining “. . . we are interested in the structure of the hyperlinks within the Web itself” • Inspired by the study of social networks and citation analysis • based on incoming & outgoing links we could discover specific types of pages (such as hubs, authorities, etc) • Some algorithms calculate the quality/relevancy of each Web page • ie. Page Rank • Others measure the completeness of a Web site • measuring frequency of local links on the same server • interpreting the nature of hierarchy of hyperlinks on one domain

  17. pg 17 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions

  18. pg 18 Web Usage Mining “. . . focuses on techniques that could predict user behavior while the user interacts with the Web.” • Web usage is mined by parsing Web server logs • mapped into relational tables → data mining techniques applied • log data utilized directly • Users connecting through proxy servers and/or users or ISP’s utilizing caching of Web data results in decreased server log accuracy • Two applications: • personalized - user profile or user modeling in adaptive interfaces • impersonalized - learning user navigation patterns

  19. pg 19 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions

  20. pg 20 Review • Web mining • 4 subtasks • IR & IE • Web content mining • primarily intra-page analysis • IR view vs DB view • Web structure mining • primarily inter-page analysis • Web usage mining • primarily analysis of server activity logs

  21. pg 21 Web mining categories

  22. pg 22 outline • Introduction • Web Mining • Web Content Mining • Web Structure Mining • Web Usage Mining • Review • Exam Questions

  23. pg 23 Exam Question 1 Q: Of the following Web mining paradigms: • Information Retrieval • Information Extraction Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.

  24. pg 24 Exam Question 1 Q: Of the following Web mining paradigms: • Information Retrieval • Information Extraction Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer. A: Information Retrieval, the search engine attempts provides a list of documents ranked by their relevancy to the search query.

  25. pg 25 Exam Question 2 Q: State one common problem hampering accurate Web usage mining? Briefly support your answer.

  26. pg 26 Exam Question 2 Q: State one common problem hampering accurate Web usage mining? Briefly support your answer. A: • Users connecting to a Web site though a proxy server, • Users (or their ISP’s) utilizing Web data caching, will result in decreased server log accuracy. Accurate server logs are required for accurate Web usage mining.

  27. pg 27 Exam Question 3 Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?

  28. pg 28 Exam Question 3 Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents? A: “Bag of words” representation.

More Related