Web People Search via Connection Analysis

Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, SharadMehrotra, and RabiaNuray-Turan From : IEEE Trans. on Knowledge and Data Engineering 2008 Presenter : 陳仲詠 Citation : 21 (Google Scholar)

Outline • 1. Introduction • 2. Overview of the approach • 3. Generating a graph representation • 4. Disambiguation algorithm • 5. Interpreting clustering results • 6. Related work • 7. Experimental Results • 8. Conclusions and Future work

Introduction (1/7) • Searching for web pages related to a personaccounts for more than 5 percent of the current Web searches [24]. • A search for a person such as say “Andrew McCallum” will return pages relevant to any person with the name Andrew McCallum. [24] R. Guha and A. Garg, Disambiguating People in Search. Stanford Univ., 2004.

Introduction (2/7) • Assume (for now) that for each such web page, the search-engine could determine which real entity (i.e., which Andrew McCallum) the page refers to. • Provide a capability of clustered person search, the returned results are clustered by associating each cluster to a real person.

Introduction (3/7) • The user can hone in on the cluster of interest to herand get all pages in that cluster. • For example, only the pages associated with that Andrew McCallum.

Introduction (4/7) • In reality, it is not obviousthat it indeed is a better option compared to searching for people using keyword-based search. • If clusters identified by the search engine corresponded to a single person, then the clustered-based approach would be a good choice.

Introduction (5/7) • The key issue is the quality of clustering algorithms in disambiguating different web pages of the namesake.

Introduction (6/7) • 1. Develop a novel algorithm for disambiguating among people that have the same name. • 2. Design a cluster-based people search approach based on the disambiguation algorithm.

Introduction (7/7) • The main contributions of this paper are the following : • A new approach for Web People Search that shows high-quality clustering. • A thorough empirical evaluation of the proposed solution (Section 7), and • A new study of the impact on search of the proposed approach (Section 7.3).

Overview of the approach (1/4) • The processing of a user query consists of the following steps: • 1. User input:A user submits a query. • 2. Web page retrieval:Retrieves a fixed number (top K) of relevant web pages.

Overview of the approach (2/4) • 3. Preprocessing : • TF/IDF. noun phrase identification. • Extraction. Named entities (NEs) and Web-related information. • 4. Graph creation: The entity-relationship (ER) graph is generated based on data extracted.

Overview of the approach (3/4) • 5. Clustering: The result is a set of clusters of these pages with the aim being to cluster web pages based on association to real person.

Overview of the approach (4/4) • 6. Cluster processing: • Sketches : A set of keywords that represent the web pages within a cluster. • Cluster ranking. • Web page ranking. • 7. Visualization of results

Generating a graph representation (1/6) • Extracted : • 1)theentities • 2)relationships • 3)hyperlinks • 4)e-mail addresses • from the web pages.

Generating a graph representation (2/6) • For example, a person “John Smith” might be extracted from two different web pages. Regardless whether the two pages refer to the same person or to two different people. Doc1 Doc2 John Smith

Generating a graph representation (3/6)

Generating a graph representation (4/6) • The relationship edges are typed. • Any hyperlinks and e-mail addresses extracted from the web page are handled in an analogous fashion.

Generating a graph representation (5/6) • A hyperlink has the form : • For example, for the URL : www.cs.umass.edu/~ mccallum/ have d3 = cs, d2 = umass, d1 = edu p1 = ~mccallum.

Generating a graph representation (6/6)

Disambiguation algorithm • 1. Input the entity relationship graph. • 2. Uses a Correlation Clustering (CC) algorithm to cluster the pages. • 3. The outcome is a set of clusters with each cluster corresponding to a person.

Disambiguation algorithmCorrelation Clustering (1/3) • CC has been applied in the past to group documents of the same topic and to other problems. • It assumes that there is a similarity function s(u, v) learned on the past data. • Each (u, v) edge is assigned a “+” (similar) or “-” (different) label, according to the similarity function s(u, v).

Disambiguation algorithmCorrelation Clustering (2/3) • The goal is to find the partition of the graph into clusters that agrees the most with the assigned labels. • The CC does not take k (the number of the resulting clusters) as its input parameter.

Disambiguation algorithmCorrelation Clustering (3/3) • The goal of CC is formulated formally : • maximize the agreement • minimize the disagreement. • The problem of CC is known to be NP-hard.

Disambiguation algorithmConnection Strength (1/6) • Use the notion of the Connection Strengthc(u, v) between two objects u and v to define the similarity functions(u, v). • The disambiguation algorithm is based on analyzing : • object features and • the ER graph for the data set.

Disambiguation algorithmConnection Strength (2/6) • A path between u and v semantically capturesinteractions between them via intermediate entities. • If the combined attraction of all these paths is sufficiently large, the objects are likely to be the same.

Disambiguation algorithmConnection Strength (3/6) • Analyzing paths : • The assumption is that each path between two objects carries in itself a certain degree of attraction.

Disambiguation algorithmConnection Strength (4/6) • The attraction between two nodes u and v via paths is measured using the connection strength measure c(u, v). • Defined as the sum of attractions contributed by each path:

Disambiguation algorithmConnection Strength (5/6) • Puv denotes the set of all L-short simple paths between u and v. • A path is L-short if its length does not exceed L and is simple if it does not contain duplicate nodes. • wp denotes the weight contributed by path p. • The weight path p contributes is derived from the type of that path.

Disambiguation algorithmConnection Strength (6/6) • Let Puv consist of c1 paths of type 1, c2 paths of type 2, . . . ; cn paths of type n.

Disambiguation algorithmSimilarity Function (1/4) • The goal is to design a powerful similarity function s(u, v) that would minimize mislabeling of the data. • Design a flexible function s(u, v), such that it will be able to automatically self-tune itself to the particular domain being processed.

Disambiguation algorithmSimilarity Function (2/4) • The similarity function s(u, v) labels data by comparing the s(u, v) value against the threshold γ. • Use the δ - band (“clear margin”) approach, label the edge (u, v). • To avoid committing to + or - decision, when it does not have enough evidence for that.

Disambiguation algorithmSimilarity Function (3/4) • Employs the standard TF/IDF schemeto compute its feature-based similarity f(u, v). • Noun phrases • Larger terms • The entire document corpus consists of K documents • N distinct terms T = {t1, t2, . . . ,tN}.

Disambiguation algorithmSimilarity Function (4/4) • Each document u : • wui is the weight

Disambiguation algorithmTraining the Similarity Function (1/2) • For each (u, v) edge, require that : • In practice, s(u, v) is unlikely to be perfect and that would manifest itself in cases where the inequalities in (5) will be violated for some of the (u, v) edges • It can be resolved in a similar manner by adding slack to each inequality in (5).

Disambiguation algorithmTraining the Similarity Function (2/2) • The task becomes to solve the linear programming problem (6) to determine the optimal values for path type weights w1, w2,…,wn and threshold γ.

Disambiguation algorithmChoosing Negative Weight (1/7) • A CC algorithm will assign an entity u to a cluster if the number of positive edges between u and the other entities in the cluster outnumbers that of the negative edges. • The number of positive edges is more than half (i.e., 50 percent).

Disambiguation algorithmChoosing Negative Weight (2/7) • To keep an entity in a cluster, it is sufficient to have only 25 percent of positive edges. • Using the w+=+1 weight for all positive edges and w-=-1/3 weight for all negative edges will achieve the desired effect.

Disambiguation algorithmChoosing Negative Weight (3/7) • One solution for choosing a good value for the weight of negative edges wis to learn it on past data. • The number of namesakes n in the top k web pages. • If n = 1, w- = 0 • All the pair connected via positive edges will be merged.

Disambiguation algorithmChoosing Negative Weight (4/7) • If n = k, it is best to choose w- = 1. • This would produce maximum negative evidence for pairs not to be merged. • w- = w-(n)

Disambiguation algorithmChoosing Negative Weight (5/7) • This observation raises two issues : • 1) n is not known to the algorithm beforehand. • 2) how to choose the w-(n) function.

Disambiguation algorithmChoosing Negative Weight (6/7) • nis not known, compute its estimated value ^n by running the disambiguation algorithm with a fixed value of w-. • The algorithm would output certain number of clusters ^n, which can be employed as an estimation of n.

Disambiguation algorithmChoosing Negative Weight (7/7) • The value of w-(^n) : • when ^n < threshold, w-(^n) = 0. • when ^n > threshold, w-(^n) = -1. • This threshold is learned from the data.

A brief Summary

Interpreting Clustering Results (1/4) • Now describe how these clusters are used to build people search. • The goal is to provide the user with a set of clusters based on association to real person. • 1. Rank the clusters. • 2. Provide a summary description with each cluster.

Interpreting Clustering Results (2/4) • Cluster rank : • Select the highest ranked page. • Cluster sketch : • The set of terms above a certain threshold is selected and used as a summary for the cluster.

Interpreting Clustering Results (3/4) • Web page rank : • These pages are displayed according to their original search engine order.

Interpreting Clustering Results (4/4) • Affinity to cluster : • Defined as the sum of the similarity values between the page p and each page v in the cluster C : • The remainder pages are displayed, the user has the option to get to these web pages too.

Web People Search via Connection Analysis

Web People Search via Connection Analysis

Presentation Transcript

Search web

Dark Web Collection, Search, and Analysis

Web Search

People Search

PTAS via Local Search

Web People Search using Extracted Attributes

Web Search

Web Search

Optimization via Search

Preliminary Connection Analysis

Web Search

Web Search

Optimization via Search

Web Search

Web Search

Connecting To A Remote Computer Via ‘Remote Desktop Web Connection’

Web Search

Web Search

Web Search

Dark Web Collection, Search, and Analysis