By Uday Kumar

By Uday Kumar WEB MINING

Agenda • World Wide Web – a brief history • Introduction to Data Mining • Data Mining Process & Techniques • Web Mining • Data Mining Vs Web Mining • Classification of Web Mining • Benefits & Application Areas of Web Mining • Web Mining Softwares • Summary

World-Wide Web - a brief history Who invented the World-Wide Web ?(Sir) Tim Berners-Lee in 1989, while working at CERN, invented the World Wide Web, including URL scheme, HTML, and in 1990 wrote the first server (httpd) and the first browser. • Web’s Characteristics: • billions of documents authored by millions of diverse people • distributed over millions of computers, connected by variety of media • Large size, Dynamic content, Time dimension and Multilingual • Different data types: text, image, hyperlinks and user usage information.

Mining Large Data Sets - Motivation • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all

Data Mining

Data Mining - Definition • It is commonly defined as the process of extracting meaningful information from data sources e.g databases,texts, images, the web e.t.c • It is the process of performing automated extraction and generating predictive information from large data banks which enables us to understand the current market trends and enables us to proactive measures to gain maximum benefit from the same.

Data Mining Process

Data Mining Tasks • Data mining makes use of various algorithms to perform a variety of tasks. These algorithms examine the sample data of a problem and determine a model that fits close to solving the problem. • A Predictive model enables you to predicts the values of data by making use of known results from a different set of sample data. The list of tasks that forms the part of predictive model are: • Classification • Regression • Time Series Analysis

Data Mining Tasks Contd.. • A Descriptive model enables you to determine the patterns and relationships in a sample data. The list of tasks that forms the part of descriptive model are: • Clustering • Summarization • Association rules • Sequence discovery

Data Mining Tasks Contd.. • Classification: enables you to classify data in a large data bank into predefined set of classes. Ex: People with age less than 40 and salary > 40k trade on-line • Regression: enables to forecast data values based on the present and past values Ex: helps the organization to predict the need for recruiting new employees and purchases based in the past and current growth rate. • Time Series Analysis: enables to predict future values for the current set of values are time dependent (monthly, yearly..) • Summarization:The use of summarization enables you to summarize a large chunk of data containing in a web page.

Data Mining Tasks Contd.. • Clustering: enables you to create new groups (clusters) based on the study of patterns and relation between values of data in a data bank. It is similar to classification but does not require you to predefine groups.(also called as Unsupervised Learning) Ex:Users A and B access similar URLs • Association Rules:It defines certain rules of associativity between data items and then use those rules to establish relationships. Ex:Find the items that tend to be purchased together and specify their relationship. • Sequence Discovery:enables to determine the sequential patterns that might exist in a large and unorganized data bank. Ex: crime detection.

Data Mining Techniques • Data mining is not so much a single technique as the idea that there is more knowledge hidden in the data than shows itself on the surface. Any technique that helps extract more out of your data is useful, list of data mining techniques are. • Statistical techniques: is the branch of mathematics, which deals with the collection and analysis of numerical data by using various methods and techniques. • Machine Learning: is the process of generating a computer system that is capable of acquiring data and integrating the data to generate useful knowledge. • Decision trees: is a tree-shaped structure, in which each branch represents a classification question while leaves of the tree represents the partition of classified information.

Data Mining Techniques • Hidden Markov Models:enables you to predict future actions to be taken in time series. The model provides the probability of a future event, when provided with the present and previous events. • Neural networks:In this a large set of historical data is analyzed in order to predict the output of a particular future situation or a problem. • Genetic algorithms:If you have a certain set of sample data, then GA enables to determine the best possible model out of a set of models in order to represent the sample data.

Traditional data mining • data is structured and relational • well-defined tables, columns, rows, keys, and constraints. Web data • Semi-structured (HTML documents)and unstructured (free text) • readily available data • rich in features and patterns Data Mining vs. Web Mining

Problems when interacting with the Web • Finding relevant information • Creating new knowledge out of the information available on the Web • Personalization of the information • Learning about consumers or individual users

Web Mining

Web Mining - Definition • “Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the Web data.” • The web mining process is similar to the data mining process, the difference is usually in the data collection. • In data mining, the data is often already collected and stored in a data warehouse. • In web mining, data collection can be a substantial task, especially for web structure and content mining, which involves crawling a large number of target web pages.

Web Mining - Subtasks • Resource finding • Retrieving intended documents • Information selection/pre-processing • Select and pre-process specific information from selected documents • Generalization • Discover general patterns at individual web sites as well as across multiple web sites • Analysis • Validation and/or interpretation of mined patterns

Web Mining Contd.. Web Mining is not IR: • Information retrieval (IR) is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible Web Mining is not IE: • Information extraction (IE) aims to extract the relevant facts from given documents • IE systems for the general Web are not feasible • Most focus on specific Web sites or content

Classification of Web Mining

Web Usage Mining • Web Usage Mining refers to the discovery of user access patterns from the web usage logs, which record every click made by each user. • The usage data records the user’s behavior when the user browses or makes transactions on the web site in order to better understand and serve the needs of users or Web-based applications. • It is an activity that involves the automatic discovery of patterns from one or more Web servers.

Web Usage Mining Contd.. • Organizations often generate and collect large volumes of data; most of this information is usually generated automatically by Web servers and collected in server log. • Analyzing such data can help these organizations to determine: • the value of particular customers • cross marketing strategies across products • the effectiveness of promotional campaigns, etc. • Typical Sources of Data • automatically generated data stored in server access logs, proxy server logs referrer logs, browser logs, bookmark data, mouse clicks and scrolls and client-side cookies • user profiles • meta data: page attributes, content attributes, usage data

Web Usage Mining Contd.. • The first web analysis tools simply provided mechanisms to report user activity as recorded in the servers. Using such tools, it was possible to determine such information as: • the number of accesses to the server • the times or time intervals of visits • the domain names and the URLs of users of the Web server. • Two main categories: • Learning a user profile (personalized) Web users would be interested in techniques that learn their needs and preferences automatically • Learning user navigation patterns (impersonalized) Information providers would be interested in techniques that improve the effectiveness of their Web site or biasing the users towards the goals of the site

Web Usage Mining Contd.. • Web servers, Web proxies, and client applications can quite easily capture Web Usage data. • Web server log: Every visit to the pages, what and when files have been requested, the IP address of the request, the error code, the number of bytes sent to user, and the type of browser used… • By analyzing the Web usage data, web mining systems can discover useful knowledge about a system’s usage characteristics and the users’ interests which has various applications: • Personalization and Collaboration in Web-based systems • Marketing • Web site design and evaluation • Decision support

Web Server Log - A Sample

Web Usage Mining Contd.. • The technique to retrieve visitor based information from web servers based log files and apply this information to analyze data is known as Web Log Mining. • The major types of log files are • Access Log- file maintains a list of all the web pages that the visitors have requested. • Agent Log- file consists of information about the browser that was used to explore the various web pages.

Web Content Mining • Web Content Mining extracts or mines useful information or knowledge from web page contents. • In this mining, patterns are extracted from online sources such as • HTML files • Text documents • Images • E-books or email messages • Audio or Video • The concept of WCM is far wider than searching for any specific term or only keyword extraction or some simple statistics of words and phrases in documents. • A tool that performs WCM can summarize a web page so that you need not read the complete document and save your time and energy.

Web Content Mining Contd.. • The two basic approaches or models to implement WCM are • Local Knowledge base Model: The abstract characterizations of several web pages are stored locally. (i.e References to several web sites relating to the categories are stored in a database and based on the selection of the category the searching is performed with in the web site) • Agent Based Model: This approach applies the Artificial Intelligence systems known as Web Agents that can perform a search on behalf of a particular user for discovering and organizing documents in the web. Some web agents can apply individual user profiles for searching information from the web and organize and interpret the discovered information.

Preprocessing Content • Content Preparation: • Extract text from HTML. • Perform Stemming. • Remove Stop Words. • Calculate Collection Wide Word Frequencies (DF). • Calculate per Document Term Frequencies (TF). Vector Creation: • Common Information Retrieval Technique. • Each document (HTML page) is represented by a sparse vector of term weights. • Typically, additional weight is given to terms appearing as keywords or in titles.

Common Mining Techniques • The more basic and popular data mining techniques include: • Classification- Classification on server logs using decision trees, Naives-Bayes classifier to discover the profiles of users belonging to a particular category. • Clustering- can be used to group users exhibiting similar browsing patterns. • Associations- can be used to relate pages that are most often referenced together in a single server session. The other significant ideas are: • Topic Identification, tracking and drift analysis • Concept hierarchy creation • Relevance of content.

Web Structure Mining • Web Structure Mining discovers useful knowledge from hyper links, which represent the structure of the web. • Web structure mining can be divided into two kinds: • Extract patterns from hyperlinks in the web. A hyperlink is a structural component that connects the web page to a different location. • Mining the document structure. It is using the tree-like structure to analyze and describe the HTML or XML tags within the web page. • The process of using the graph theory to analyze the node and connection structure of a web site.

Web Structure Mining Contd.. • Web Structure is a useful source for extracting information such as • Web Page Classification • Classifying web pages according to various topics Quality of Web Page • The authority of a page on a topic • Ranking of web pages Which pages to crawl • Deciding which web pages to add to the collection of web pages Finding Related Pages • Given one relevant page, find all related pages

Web Structure Mining Contd.. The Hyperlink Induced Topic Search (HITS) is the common method or algorithm for knowledge discovery in the Web. The Concept of HITS is

Web Structure Mining • Identication of • Authorities: authoritative, high-quality web pages on broad topics • hubs: web pages that link to a collection of authorities • A good authority is pointed to by many good hubs • A good hub points to many good authorities Web structure mining has been largely influenced by research in • Social network analysis • Citation analysis (bibliometrics). • in-links: the hyperlinks pointing to a page • out-links: the hyperlinks found in a page. • Usually, the larger the number of in-links, the better a page is.

Web Structure Mining Contd.. Each Web page is a node of the Web-graph The out-degree of a node, is the number of distinct links originating at that point to other nodes. The probability, at any step, that the person will continue is a damping factor d =0.85 N- Number of web pages

Application Areas of Web Mining • E-commerce • Search Engines • Personalization • Website Design • Web mining applications • Amazon.com • Google • Double Click • AOL • Ebay • MyYahoo • CiteSeer • I-MODE • v-TAG Web Mining Server

Applications Contd.. Amazon: A host of Web mining techniques, e.g. associations between pages visited, click-path analysis, etc., are used to improve the customer’s experience during a ’store visit’. Knowledge gained from Web mining is the key intelligence behind Amazon’s features such as ’instant recommendations’, ’purchase circles’, ’wish-lists’, etc.

Applications Contd.. Google • Earlier search engines concentrated on the Web content to return the relevant pages to a query. Google was the first to introduce the importance of the link structure in mining the information from the web. Page Rank, that measures an importance of a page, is the underlying technology in all Google search products. • The Page Rank technology, that makes use of the structural information of the Web graph, is the key to returning quality results relevant to a query.

Benefits of Web Mining • Match your available resources to visitor interests • Increase the value of each visitor • Improve the visitor's experience at the website • Perform targeted resource management • Collect information in new ways • Test the relevance of content and web site architecture

Web Mining Softwares • Web Miner: • Sinope Summarizer: • Teleport Pro: • Click Tracks

Summary • Major Limitations of Web Mining research: • Difficult to collect Web Usage data across different Web Sites. • Lack of suitable test collections that can be reused by researchers Future research directions: • Multimedia data mining: A picture is worth a thousand words. • Multilingual knowledge extraction: Web page translations • The Hidden Web: Forms, Dynamically generated web pages. • Semantic Web • Wireless Web: WML and HDML.

Thank You

By Uday Kumar

By Uday Kumar

Presentation Transcript

Container [ Uday Hiwarale ]

Presented By: Amith Kumar Vangala

PRESENTED BY, SHANKAR KUMAR

By: Engr. Hinesh Kumar

Presented by kishore kumar

Made By : Mr. Pradeep Kumar

by : Amit kumar upadhyay

Presented by Sailesh Kumar

By: Maya Kumar

Presented by: Sailesh Kumar

Presented by: Sailesh Kumar

Presented by: Sailesh Kumar

Presented by: Sailesh Kumar

By: Macy, Mason, Thabo, Uday, and Elvis

-Uday Dhokale

by KUMAR ANAND

BY DEBA KUMAR TRIPATHY

By Santosh Kumar Nukavarapu

Presented by: Sailesh Kumar

Presented by: Saurav Kumar Bengani

Uday Joshi Massachusetts

Presented by Sailesh Kumar