Web Mining Dr. Tao Li Florida International University

Web MiningDr. Tao LiFlorida International University

Mining the World-Wide Web • The WWW is huge, widely distributed, global information service center for • Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. • Hyper-link information • Access and usage information • WWW provides rich sources for data mining • Challenges • Too huge for effective data warehousing and data mining • Too complex and heterogeneous: no standards and structure

Mining the World-Wide Web • Growing and changing very rapidly • Broad diversity of user communities • Only a small portion of the information on the Web is truly relevant or useful • 99% of the Web information is useless to 99% of Web users • How can we find high-quality Web pages on a specified topic?

Web search engines • Index-based: search the Web, index Web pages, and build and store huge keyword-based indices • Help locate sets of Web pages containing certain keywords • Deficiencies • A topic of any breadth may easily contain hundreds of thousands of documents • Many documents that are highly relevant to a topic may not contain keywords defining them (polysemy)

Web Mining: A more challenging task • Searches for • Web access patterns • Web structures • Regularity and dynamics of Web contents • Problems • The “abundance” problem • Limited coverage of the Web: hidden Web sources, majority of data in DBMS • Limited query interface based on keyword-oriented search • Limited customization to individual users

Web Mining Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining General Access Pattern Tracking Customized Usage Tracking Search Result Mining Web Mining Taxonomy

Web Mining Mining the World-Wide Web Web Content Mining Web Structure Mining Web Usage Mining • Web Page Content Mining • Web Page Summarization • WebLog (Lakshmanan et.al. 1996),WebOQL(Mendelzon et.al. 1998) …: • Web Structuring query languages; • Can identify information within given web pages • Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages • ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages General Access Pattern Tracking Customized Usage Tracking Search Result Mining

Web Mining Mining the World-Wide Web Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining • Search Result Mining • Search Engine Result Summarization • Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): • Categorizes documents using phrases in titles and snippets General Access Pattern Tracking Customized Usage Tracking

Web Mining Mining the World-Wide Web Web Content Mining Web Usage Mining • Web Structure Mining • Using Links • PageRank (Brin et al., 1998) • CLEVER (Chakrabarti et al., 1998) • Use interconnections between web pages to give weight to pages. • Using Generalization • MLDB (1994), VWV (1998) • Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure. General Access Pattern Tracking Search Result Mining Web Page Content Mining Customized Usage Tracking

Web Mining Mining the World-Wide Web Web Content Mining Web Structure Mining Web Usage Mining Web Page Content Mining Customized Usage Tracking • General Access Pattern Tracking • Web Log Mining (Zaïane, Xin and Han, 1998) • Uses KDD techniques to understand general access patterns and trends. • Can shed light on better structure and grouping of resource providers. Search Result Mining

Web Mining Mining the World-Wide Web Web Content Mining Web Structure Mining Web Usage Mining • Customized Usage Tracking • Adaptive Sites (Perkowitz and Etzioni, 1997) • Analyzes access patterns of each user at a time. • Web site restructures itself automatically by learning from user access patterns. Web Page Content Mining General Access Pattern Tracking Search Result Mining

Social Network analysis • Web as a hyperlink graph • evolves organically, • No central coordination, • Yet shows global and local properties • social network analysis • well established long before the Web • Popularity estimation for queries • Measurements on Web and the reach of search engines • Web : An example of social network

Social Network • Properties related to connectivity and distances in graphs • Applications • Epidemiology, espionage: • Identifying a few nodes to be removed to significantly increase average path length between pairs of nodes. • Citation analysis • Identifying influential or central papers.

Limitations of text based analysis • Text-based ranking function • Eg. Could www.harvard.edu be recognized as one of the most authoritative pages, since many other web pages contain “harvard” more often. • Pages are not sufficiently self – descriptive • Usually the term “search engine” doesn't’t appear on search engine web pages Web database Keyword Web pages

Bow-tie Theory

Exploiting link structure • Ranking search results • Keyword queries not selective enough • Use graph notions of popularity/prestige • PageRank and HITS • Supervised and unsupervised learning • Hyperlinks and content are strongly correlated • Learn to approximate joint distribution • Learn discriminants given labels

What are the benefits of link building? • Following a link is one of the most popular ways for people to find new sites. • By providing links to other material people don't have to re-invent the wheel. • Inbound links help to build trust. • Link structure and link text provide a lot of information for making relevance judgments and quality filtering • The link structure implies an underlying social structure in the way that pages and links are created, and it is an understanding of this social organization that can provide us the most leverage.

Link-based Ranking Strategies • Leverage the • “Abundance problems” inherent in broad queries • Google’s PageRanking [Brin and Page WWW7] • Measure of prestige with every page on web • HITS: Hyperlink Induced Topic Search [Jon Klienberg ’98] • Use query to select a sub-graph from the Web. • Identify “hubs” and “authorities” in the sub-graph

Pre-computes a rank-vector Provides a-priori (offline) importance estimates for all pages on Web Independent of search query In-degree  prestige Not all votes are worth the same Prestige of a page is the sum of prestige of citing pages:p = Ep Pre-compute query independent prestige score Query time: prestige scores used in conjunction with query-specific IR scores Google(PageRank): Overview

Assumption the prestige of a page is proportional to the sum of the prestige scores of pages linking to it Random surfer on strongly connected web graph E is adjacency matrix of the Web No parallel edges matrix L derived from E by normalizing all row-sums to one: . Google(PageRank)

After ith step: Convergence to stationary distribution of L. p -> principal eigenvector of LT Called the PageRank Convergence criteria L is irreducible there is a directed path from every node to every other node L is aperiodic for all u & v, there are paths with all possible number of links on them, except for a finite set of path lengths The PageRank

Correspondence between “surfer model” and the notion of prestige Page v has high prestige if the visit rate is high This happens if there are many neighbors u with high visit rates leading to v Deficiency Web graph is not strongly connected Only a fourth of the graph is ! Web graph is not aperiodic Rank-sinks Pages without out-links Directed cyclic paths The surfing model

Two way choice at each node With probability d (0.1 < d < 0.2), the surfer jumps to a random page on the Web. With probability 1–d the surfer decides to choose, uniformly at random, an out-neighbor MODIFIED EQUATION Direct solution of eigen-system not feasible. Solution : Power iterations Surfing model: simple fix

PageRank (Simple structure of Google search engine) query offline TextIndex() Query-time Inverted Text index Query Processor Web Page rank PageRank() Ranked results

Ranking of pages more important than exact values of pi Convergence of page ranks in 52 iterations for a crawl with 322 million links. Pre-compute and store the PageRank of each page. PageRank independent of any query or textual content. Ranking scheme combines PageRank with textual match Unpublished Many empirical parameters, human effort and regression testing. Criticism : Ad-hoc coupling and decoupling between relevance and prestige PageRank architecture at Google

Authorities and Hubs • A good authority is a page that is pointed by many good hubs, while a good hub is a page that points to many good authorities. • This is the mutually reinforcing relationship. The authority pages are those that contain the most definitive, central, and useful information in the context of particular topics. Hubs that link to a collection of prominent sites on a common topic hubs authorities

Relies on query-time processing To select base set Vq of links for query q constructed by selecting a sub-graph R from the Web (root set) relevant to the query selecting any node u which neighbors any r \in R via an inbound or outbound edge (expanded set) To deduce hubs and authorities that exist in a sub-graph of the Web Every page u has two distinct measures of merit, its hub score h[u] and its authority score a[u]. Recursive quantitative definitions of hub and authority scores HITS

Hits (Hyperlink-Induced Topic Search) • The focused subgraph is created by first taking the highest-ranked pages from a text-based search engine as a root set R. • R is expanded into the base set S by taking all sites pointing to or pointed at by a site in R. • Note that while R may fail to contain some “important” authorities, S will probably contain them. u Root set Rn … R1 … Sn S1 Base set

Computing Hubs and Authorities(1) For each page p, we associate a non-negative authority weight ap and a non-negative hub weight hp. (1) (2) Number the pages{1,2,…n} and define their adjacency matrix A to be the n*n matrix whose (i,j)th entry is equal to 1 if page i links to page j, and is 0 otherwise. Define a=(a1,a2,…,an) and h=(h1,h2,…,hn). (3) (4)

Computing Hubs and Authorities(2) • In other words, a is an eigenvector of B: • B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. • B is symmetric and has n orthogonal unit eigenvectors. (5) (6) (7) Let

Computing Hubs and Authorities(3) • We initialize a(p) = h(p) = 1 for all p. • We iterate the following operations: • And renormalize after each iteration

Computing Hubs and Authorities(4) • The eigenvectors of B are precisely the stationary points of this process. • h is the principal eigenvector of ATA, and a is the principal eigenvector of AAT. • The principal eigenvector represents the “densest cluster” within the focused subgraph. • By initializing a(p)=h(p)=1, a will converge to the principal eigenvector of B. • Initializing differently may lead to convergence to a different eigenvector. • In practice convergence is achieved after only 10-20 iterations.

PageRank Computing u: web page v: page links to uBu: the set of pages c: a factor for normilization (C <1) (1) Let A be a square matrix with rows and columns corresponding to web pages. Let If let R as vector over web pages, Then R = cAR. (2) R is an eigenvector of A with eigenvalue c.

Send query to a text-based IR system and obtain the root-set. Expand the root-set by radius one to obtain an expanded graph. Run power iterations on the hub and authority scores together. Report top-ranking authorities and hubs. HITS: Topic Distillation Process

The HITS algorithm. “h” and “a”are L1vector norms

HITS algorithm = running SVD on the hyperlink relation (source,target) LSI algorithm = running SVD on the relation (term,document). PageRank on root set R gives same ranking as the ranking of hubs as given by HITS Relation between HITS, PageRank and LSI

PageRank advantage over HITS Query-time cost is low HITS: computes an eigenvector for every query Less susceptible to localized link-spam Offline computing Focuses on authoritative pages Computing all the web pages HITS advantage over PageRank HITS ranking is sensitive to query HITS has notion of hubs and authorities Query time computing Computing the base set pages Topic-sensitive PageRanking [Haveliwala WWW11] Attempt to make PageRanking query sensitive PageRank vs HITS

Automatic Classification of Web Documents • Assign a class label to each document from a set of predefined topic categories • Based on a set of examples of preclassified documents • Example • Use Yahoo!'s taxonomy and its associated documents as training and test sets • Derive a Web document classification scheme • Use the scheme classify new Web documents by assigning categories from the same taxonomy • Keyword-based document classification methods • Statistical models

Web Usage Mining • Mining Web log records to discover user access patterns of Web pages • Applications • Target potential customers for electronic commerce • Enhance the quality and delivery of Internet information services to the end user • Improve Web server system performance • Identify potential prime advertisement locations • Web logs provide rich information about Web dynamics • Typical Web log entry includes the URL requested, the IP address from which the request originated, and a timestamp

Techniques for Web usage mining • Construct multidimensional view on the Weblog database • Perform multidimensional OLAP analysis to find the top N users, top N accessed Web pages, most frequently accessed time periods, etc. • Perform data mining on Weblog records • Find association patterns, sequential patterns, and trends of Web accessing • May need additional information,e.g., user browsing sequences of the Web pages in the Web server buffer • Conduct studies to • Analyze system performance, improve system design by Web caching, Web page prefetching, and Web page swapping

Mining the World-Wide Web • Design of a Web Log Miner • Web log is filtered to generate a relational database • A data cube is generated form database • OLAP is used to drill-down and roll-up in the cube • OLAM is used for mining interesting knowledge Knowledge Web log Database Data Cube Sliced and diced cube 1 Data Cleaning 2 Data Cube Creation 4 Data Mining 3 OLAP

Web Mining Dr. Tao Li Florida International University

Web Mining Dr. Tao Li Florida International University

Presentation Transcript

Florida International University

Florida International University

Florida International University

Florida International University

Florida International University

Florida International University

Florida International University

Florida International University

CAP 4770: Introduction to Data Mining Fall 2008 Dr. Tao Li Florida International University

Tao Li Zuyun Xue Xiamen University March 2005

Tao Li Assistant Professor, University of Warwick

TAN TAO UNIVERSITY

Florida International University

Florida International University

Florida International University

COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University

CAP 4770: Introduction to Data Mining Fall 2010 Dr. Tao Li Florida International University

Text Mining Dr. Tao Li Florida International University

Funded through Florida International University

CAP 4770: Introduction to Data Mining Fall 2008 Dr. Tao Li Florida International University