1 / 22

WEB STRUCTURE MINING

WEB STRUCTURE MINING. SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18. INTRODUCTION. Web mining is the application of data mining techniques in search engines.

amy
Télécharger la présentation

WEB STRUCTURE MINING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WEB STRUCTUREMINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18

  2. INTRODUCTION • Web mining is the application of data mining techniques in search engines. • Data mining - process of discovering useful knowledge from data sources • Web mining automatically discover and extract information from Web documents. • Web structure mining discovers useful data from hyperlinks.

  3. WEB MINING • Useful patterns extraction from WWW resources • WWW is widely distributed, global information service centre that constitutes a rich source for data mining • Employing techniques from Data Mining, information retrieval,etc.

  4. NEED FOR WEB MINING • Aims at finding and extracting relevant information that is hidden in web- related data. • The challenge is to bring back the semantics of hyper text document • To turn web data into web knowledge

  5. CLASSIFICATION

  6. WEB STRUCTURE MINING • Generate structural summary about the Web site and Web page • Use graph theory to analyse node and connection structure of a web site • Analysis of the link structure of the web, and its purposes is to identify more preferable documents

  7. WEB STRUCTURE MINING cont….. Discovering the nature of the hierarchy of hyperlinks in the website and its structure Hyperlink identifies author’s endorsement of the other web page Retrieving information about the relevance and the quality of the web page.

  8. Page Layout and Link Analysis for Web Images

  9. WEB BASICS • A web is a huge collection of documents linked together by references. • To refer from one document to another is based on hyper text and embedded in HTML • HTML describes how the document should display on browser window • Web document has a web address called URL that identifies it uniquely.

  10. WEB CRAWLERS • Collects “all” web documents by browsing the Web systematically and exhaustively • Region of the web to be crawled can be specified by using the URL structure. • Used by a search engine to provide local access to the most recent versions of possibly all web pages

  11. INDEXING AND KEYWORD SEARCH • There are two types of data: structured and unstructured • Structured data have keys associated with each data item that reflect its content • Content-based access to unstructured data without considering the meaning is the keyword search approach

  12. DOCUMENT REPRESENTATION • To facilitate the process of matching keywords and documents, some preprocessing steps are taken first: • Documents are tokenized • Characters are converted to upper or lower case • Words reduced to canonical form • Stopwords are usually removed

  13. ALGORITHMS • There are two main algorithms used in web structure mining 1. HITS (Hypertext-Induced Topic Search) 2. Page rank algorithm

  14. HITS (Hypertext-Induced Topic Search) • Link analysis algorithm • Rates web pages • Developed by Jon Kleinberg • Determines two values for a page • Authority-estimates the value of the content of the page • Hub-estimates the value of its links to other pages

  15. Hubs and Authorities • Hub pages point to interesting links to authorities = relevant pages • Authorities are targets of hub pages

  16. Continue…… • Authority and hub values are defined in terms of one another in a mutual recursion • It is executed at querry time with the associated HIT on performance

  17. Page Rank • Link analysis algorithm • Assigns a numerical weightage to each element of a hyperlinked set of documents • Denoted by PR(E) • Relies on uniquely democratic nature • Link from page A to page B is a vote, by page A, for page B

  18. Continue….. • Here, A considers itself important and help to make B important • Also a probability distribution – represents the probability that a click on a link arrives at any particular page • Page rank of 0.5 -> 50% chance that a person clicking on a link will be directed to the document with the 0.5 page rank

  19. APPLICATIONS • Information retrieval in social networks. • To find out the relevancy of each Web page • Measuring completeness of the Web sites • Used in search engines to find out relevant information

  20. CONCLUSION • Search engines uses web structure mining to find the information. • We can create new knowledge out of the available information • Web Content mining can be added to it to enhance the performance of search engines.

  21. Thank You!

  22. Questions ?

More Related