NAYOSE: A System for Reference Disambiguation of Proper Nouns Appearing on Web Pages

1. NAYOSE: A System for Reference Disambiguation of Proper Nouns Appearing on Web Pages
AIRS2006 Oct. 16-18, 2006 Singapore Shingo Ono, Minoru Yoshida, and Hiroshi Nakagawa The University of Tokyo, Japan Hi, I�m Shingo Ono from the University of Tokyo. I will be taking about `NAYOSE: a system for reference disambiguation of proper nouns appearing on web pages.�Hi, I�m Shingo Ono from the University of Tokyo. I will be taking about `NAYOSE: a system for reference disambiguation of proper nouns appearing on web pages.�
2. Table of Contents
Motivation NAYOSE System Results Conclusion This is the outline of my talk. In this presentation, first I�m going to talk about the motivation. Then, I will talk about Baseline system. Next, I will show you our NAYOSE system. I will also show you two algorithms to improve system performance. After that, I will show you the results of our evaluation. Finally, I will talk about conclusion and future works.This is the outline of my talk. In this presentation, first I�m going to talk about the motivation. Then, I will talk about Baseline system. Next, I will show you our NAYOSE system. I will also show you two algorithms to improve system performance. After that, I will show you the results of our evaluation. Finally, I will talk about conclusion and future works.
3. Motivation
Have you ever had trouble when you have used a common name as a query in a search engine? The motivation of this study is as follows. Have you ever had trouble when you have used a common person or place name as a query in a search engine? The motivation of this study is as follows. Have you ever had trouble when you have used a common person or place name as a query in a search engine?
Doctor Baseball Player Foreign student Me Shingo Ono
4. This screenshot is the search result of my name for example. The search result shows some person named Shingo Ono. It is hard to guess how many different person exists in the real-world. In this example, each person refers different entity, and four person named Shingo Ono exists.This screenshot is the search result of my name for example. The search result shows some person named Shingo Ono. It is hard to guess how many different person exists in the real-world. In this example, each person refers different entity, and four person named Shingo Ono exists.
5. Problem and our solution
When different real-world entity has the same name, the reference from the name to the entity can be ambiguous. We considered NAYOSE System on the Web The system gives results clusters of Web pages. To solve this undesirable problem, we developed NAYOSE System. Nayose means reference disambiguation in Japanese. The system gave provides clusters of Web pages, and it makes a target page accessible.To solve this undesirable problem, we developed NAYOSE System. Nayose means reference disambiguation in Japanese. The system gave provides clusters of Web pages, and it makes a target page accessible.
6. Our System
Query Search engine Search result A set of page clusters Each pages of a cluster refers to the same entity. This figure shows an image of Nayose System. The input of the system is a result of search engine as I showed a previous slide and the output is a set of page clusters. Each of clusters includes Web pages all of which refer to the same entity.This figure shows an image of Nayose System. The input of the system is a result of search engine as I showed a previous slide and the output is a set of page clusters. Each of clusters includes Web pages all of which refer to the same entity.
7. Related Works
Use information from documents [Bagga and Baldwin, 1998] � Naive VSM [Mann and Yarowsky, 2003] � Biographic data [Niu et al., 2004] � Personal info. [Wan et al., 2005] � Middle name oriented Use information from Web structure [Bekkerman and MacCallum, 2005] � Link structure and double clustering Related works are as follows. There have been several works tried to solve a reference disambiguation problem. Bagga and Baldwin applied the vector space model to calculating similarity between names using only co-occurring words. Basis on this, Mann and Yarowsky presented clustering method using extracted biographic data, and Niu et al. presented using extracted personal information. However these methods had only been tested on artificial small test data. Wan et al. proposed a system that rebuilt search results for person names, called WebHawk,and their aim was at practical use like ours. However, an algorithm on the system was specialized for middle name, and it would not be suitable for other types of person names or place name. As another approach, Bekkerman and MacCallum proposed two methods of finding Web pages referring to a particular person. Their work consists of two distinct mechanisms such as link structure analysis and agglomerative conglomerative double clustering. However, they focused on disambiguating an existing social network of people, which is not the case for searching people in reality. In addition, based on our experience, the number of links between Web pages were fewer than we expected. Therfore, information on link structures would be difficult to use to solve our task.Related works are as follows. There have been several works tried to solve a reference disambiguation problem. Bagga and Baldwin applied the vector space model to calculating similarity between names using only co-occurring words. Basis on this, Mann and Yarowsky presented clustering method using extracted biographic data, and Niu et al. presented using extracted personal information. However these methods had only been tested on artificial small test data. Wan et al. proposed a system that rebuilt search results for person names, called WebHawk,and their aim was at practical use like ours. However, an algorithm on the system was specialized for middle name, and it would not be suitable for other types of person names or place name. As another approach, Bekkerman and MacCallum proposed two methods of finding Web pages referring to a particular person. Their work consists of two distinct mechanisms such as link structure analysis and agglomerative conglomerative double clustering. However, they focused on disambiguating an existing social network of people, which is not the case for searching people in reality. In addition, based on our experience, the number of links between Web pages were fewer than we expected. Therfore, information on link structures would be difficult to use to solve our task.
8. Baseline System
We first implemented a simple system as our baseline. Calculate similarity by the Bag of Words (BoW) Model Adopt Agglomerative Hierarchical Clustering We evaluated this system, and found F-measure waslower than 0.5. This is because BoW Model has shortcomings. At first, we implemented a simple system as our baseline. All we need to do is to cluster Web pages. Therefore, we adopted Bag of Words Model for calculating similarity between Web pages, and Agglomerative Hierarchical Clustering for clustering, We evaluated this trial system, and we found F-measure was lower than zero point five. We considered this is because Bag of Words model is not sufficient for our task.At first, we implemented a simple system as our baseline. All we need to do is to cluster Web pages. Therefore, we adopted Bag of Words Model for calculating similarity between Web pages, and Agglomerative Hierarchical Clustering for clustering, We evaluated this trial system, and we found F-measure was lower than zero point five. We considered this is because Bag of Words model is not sufficient for our task.
9. Proposed Methods
Bag of Words Model only focused on the words frequency. There were other profitable information in document, such as: Word positions Local Context Matching Word meanings Named Entities Matching The shortcoming of Bag of Words model was as follows. Bag of Words Model only focused on the words frequency, and never focused on the other important information in document such as word positions and word meanings. So, we proposed new two methods to overcome this shortcoming. One method is Local Context Matching using the positions of words. The other method is Named Entities Matching using the meanings of words.The shortcoming of Bag of Words model was as follows. Bag of Words Model only focused on the words frequency, and never focused on the other important information in document such as word positions and word meanings. So, we proposed new two methods to overcome this shortcoming. One method is Local Context Matching using the positions of words. The other method is Named Entities Matching using the meanings of words.
10. Local Context Matching (LC)
Use of word positions Supposition: relative words occur near the target name Focus on the words near the query string In the case of person�s name, personal data will appear near the name. (such as his/her age, affiliation, position, and so on.) Local Context Matching deals with shortcoming of Bag of Words Model by giving nearby words higher scores than others. We supposed words occurred near the name are more relative than the other words. For example, in the case of person�s name, personal data such as age, affiliation, position and so on, will appear near the name. So, Local Context Matching focuses on the words near the query string. Local Context Matching deals with shortcoming of Bag of Words Model by giving nearby words higher scores than others. We supposed words occurred near the name are more relative than the other words. For example, in the case of person�s name, personal data such as age, affiliation, position and so on, will appear near the name. So, Local Context Matching focuses on the words near the query string.
11. Algorithm for Local Context Matching
(Extraction) For all documents , do the following. Find all appearance positions of query string q. For all , put words whose positions are from to into . Remove stop words from . --- - --- --- -- ---- ----- query --- --- ---- ------ ----- --- ---- --- ---- ------ ----- --- ---- --- ---- ------ ----- --- ---- ---- - ---- query --- --- ---- ------ ----- --- ---- --- -- -- - extract these words and put into nearby word set This is the algorithm for Local Context Matching. The Algorithm consists of two steps, the Extraction and the Calculation. At the first step, that is extraction, is as follows. First of all, find all appearance positions of query string. Next, extract words whose positions are nearby the query string, and put them into nearby word set. And, remove meaningless words such as �name�, �Mr.�, �Ms.� and so on, from nearby words set The second step is calculation. For all document pairs, calculate Local Context Similarity as this. Finally, determine whether two pages refer the same entities or not by threshold.This is the algorithm for Local Context Matching. The Algorithm consists of two steps, the Extraction and the Calculation. At the first step, that is extraction, is as follows. First of all, find all appearance positions of query string. Next, extract words whose positions are nearby the query string, and put them into nearby word set. And, remove meaningless words such as �name�, �Mr.�, �Ms.� and so on, from nearby words set The second step is calculation. For all document pairs, calculate Local Context Similarity as this. Finally, determine whether two pages refer the same entities or not by threshold.
12. Algorithm for Local Context Matching
--- - --- --- -- ---- ----- query --- --- ---- ------ ----- --- ---- --- ---- ------ ----- --- ---- --- ---- ------ ----- --- ---- ---- - ---- query --- --- ---- ------ ----- --- ---- --- -- -- - extract these words and put into nearby word set (Calculation) For all document pairs , Calculate LC similarity as: If , then regard the query string appearing on two pages refers the same entities. This is the algorithm for Local Context Matching. The Algorithm consists of two steps, the Extraction and the Calculation. At the first step, that is extraction, is as follows. First of all, find all appearance positions of query string. Next, extract words whose positions are nearby the query string, and put them into nearby word set. And, remove meaningless words such as �name�, �Mr.�, �Ms.� and so on, from nearby words set The second step is calculation. For all document pairs, calculate Local Context Similarity as this. Finally, determine whether two pages refer the same entities or not by threshold.This is the algorithm for Local Context Matching. The Algorithm consists of two steps, the Extraction and the Calculation. At the first step, that is extraction, is as follows. First of all, find all appearance positions of query string. Next, extract words whose positions are nearby the query string, and put them into nearby word set. And, remove meaningless words such as �name�, �Mr.�, �Ms.� and so on, from nearby words set The second step is calculation. For all document pairs, calculate Local Context Similarity as this. Finally, determine whether two pages refer the same entities or not by threshold.
13. Clustering Algorithm
Each edge is exist if and only if two pages were to be in the same cluster. Clustering for Local Context Matching is done as follows. Consider an undirected graph and each vertex corresponds to Web page. Then, edge set is given by Local Context Matching. Each edge means two pages are in the same cluster.Clustering for Local Context Matching is done as follows. Consider an undirected graph and each vertex corresponds to Web page. Then, edge set is given by Local Context Matching. Each edge means two pages are in the same cluster.
14. Clustering Algorithm
Each connected components means one cluster of Web pages Then, there are some connected components in the graph. Each connected components means one cluster of Web pages. Now, we have done clustering by Local Context Matching.Then, there are some connected components in the graph. Each connected components means one cluster of Web pages. Now, we have done clustering by Local Context Matching.
15. Named Entities Matching (NE)
Use of word meanings Named Entities (NEs) are generally more discriminating than general words. Focus on the NEs co-occur with query string Co-occuring NEs must be related to query string. The other proposed method is Named Entities Matching. Named Entities are generally more discriminating than general words. So, we attempted to determine the same entity by using Named Entities. For example, if the target person name co-occurs with another person name on many Web pages, we can determine all the target person refers to the same entity. The other proposed method is Named Entities Matching. Named Entities are generally more discriminating than general words. So, we attempted to determine the same entity by using Named Entities. For example, if the target person name co-occurs with another person name on many Web pages, we can determine all the target person refers to the same entity.
16. Algorithm for Named Entities Matching
Clustering is done in the same way as LC --- - --- --- -- ---- ----- query --- --- ---- ------ ----- --- ---- --- ---- ------ ----- --- ---- --- NE ----- NE ---- ---- - ---- query NE --- ---- ------ ----- --- ---- --- -- -- - extract Named Entities co-occurred with the query For all document pairs, Calculate NE similarity as: If , the query string appearing on two pages refers same entities. number of person names appearing in both number of place names appearing in both The Algorithm for Named Entities Matching is as follows. It�s similar to Local Context Matching. First extract person names and place names by Named Entity tagger. And next, calculate named entities similarity by number of person and place names appearing in both of two Web pages. Finally, determine whether two pages refer the same entities or not by threshold, and clustering is done in the same way as Local context Matching. The Algorithm for Named Entities Matching is as follows. It�s similar to Local Context Matching. First extract person names and place names by Named Entity tagger. And next, calculate named entities similarity by number of person and place names appearing in both of two Web pages. Finally, determine whether two pages refer the same entities or not by threshold, and clustering is done in the same way as Local context Matching.
17. Filtering Junk Pages
There are meaningless pages on the Web. The meaningless pages cause errors on the reference disambiguation. I talked about two methods for reference disambiguation. Now, I would like to talk about Web pages filtering. We found there are meaningless pages on the Web, and these meaningless pages cause mistakes on the reference disambiguation. I talked about two methods for reference disambiguation. Now, I would like to talk about Web pages filtering. We found there are meaningless pages on the Web, and these meaningless pages cause mistakes on the reference disambiguation.
18. For example, we found Web page listing results of sports. Such result pages had little information and it�s difficult to use these result pages. For example, we found Web page listing results of sports. Such result pages had little information and it�s difficult to use these result pages.
19. For another example, the same name in this page refers to multiple entities. Many Michael Jackson appeared in this page, and each of them refer different entity. These pages were beyond the scope of our task.For another example, the same name in this page refers to multiple entities. Many Michael Jackson appeared in this page, and each of them refer different entity. These pages were beyond the scope of our task.
20. Filtering Junk Pages
There are meaningless pages on the Web. The meaningless pages cause errors on the reference disambiguation. Removing junk pages with filtering rules.
21. Overview of NAYOSE System
User NAYOSE System Web Search result Query SearchEngine Result (Clusters of pages) Web Accessing Web Pageswhich has query string Filtering Web Pages Calculation of Similarity And Clustering Preprocessing Text Available pages URL Interface I have talked about all elements of our NAYOSE System. Now, I will show you an overview of the NAYOSE System. The system works as follows. After receiving a query from an user, the system first retrieves Web pages URLs with a search engine and obtains the top k search results. Next, the system downloads all top k pages, and executes preprocessing, junk page filtering, morphological analysis and Named Entity tagging. After that, the system calculates the similarity between Web pages and does clustering. Finally the system outputs the results, clusters of Web pages.I have talked about all elements of our NAYOSE System. Now, I will show you an overview of the NAYOSE System. The system works as follows. After receiving a query from an user, the system first retrieves Web pages URLs with a search engine and obtains the top k search results. Next, the system downloads all top k pages, and executes preprocessing, junk page filtering, morphological analysis and Named Entity tagging. After that, the system calculates the similarity between Web pages and does clustering. Finally the system outputs the results, clusters of Web pages.
22. Screenshots of NAYOSE System
Sorry, our system works only in Japanese at this time. Result of clustering about Shingo Ono Execution time: about 5 seconds.
23. Screenshots of NAYOSE System
Sorry, our system works only in Japanese at this time. A clusterall pages refer me Web page of our lab. Co-author�s Web page Web pages about dept.where I belong
24. Evaluation
Data set: Each data set composed of the top 100 � 200 results from search engines. We collected 3859 pages on 37 queries.(28 person names and 9 place names) We annotated data set by the hands. We did not use artificial data set, but real-world data set. Now, I will talk about our evaluation. First, I will explain our data set. As far as we know, no gold standard for the task has yet been proposed. So, we originally developed the test set for this task. We first input Japanese person-name queries and Japanese place-names queries into a search engine, and we collected the top of about one hundred or two hundreds Web pages. As a result, we collected about three thousands eight hundreds Web pages, and thirty-seven queries. Those pages were manually annotated, and we collected real-world data set.Now, I will talk about our evaluation. First, I will explain our data set. As far as we know, no gold standard for the task has yet been proposed. So, we originally developed the test set for this task. We first input Japanese person-name queries and Japanese place-names queries into a search engine, and we collected the top of about one hundred or two hundreds Web pages. As a result, we collected about three thousands eight hundreds Web pages, and thirty-seven queries. Those pages were manually annotated, and we collected real-world data set.
25. Evaluation
Purpose: Which methods (BoW, LC, NE, and combination of them) is the best? Metrics: Precision (P), Recall (R) and F-measure (F) Metrics were calculated as [Larsen and Aone, 1999]. The purpose of evaluation was to investigate which methods were the best, among Bag of Words, Local Context Matching, Named Entities Matching and combination of them. We used Precision, Recall, and F-measure as the metrics in our evaluation. Evaluation method follows the one by Larsen and AONE.The purpose of evaluation was to investigate which methods were the best, among Bag of Words, Local Context Matching, Named Entities Matching and combination of them. We used Precision, Recall, and F-measure as the metrics in our evaluation. Evaluation method follows the one by Larsen and AONE.
26. Results
This is the results of our evaluation. According to the results, NE, LC, and their combination showed higher performance than baseline. The combination of Named Entities Matching and Local Context Matching outperformed the baseline by a significant 0.22 in the overall F-measure.This is the results of our evaluation. According to the results, NE, LC, and their combination showed higher performance than baseline. The combination of Named Entities Matching and Local Context Matching outperformed the baseline by a significant 0.22 in the overall F-measure.
27. Results of 26 person-name queries
28. Results of 9 place-name queries
29. Thank you!
That�s all. Thank you.That�s all. Thank you.
30. (Appendix) How to do clustering when two or three methods were applied
In the case of combination of BoW and NE/LC NE/LC methods were applied first, andBoW was then applied to the NE/LC result. In the case of combination of NE and LC Clustering were done at the same time. Detail will be described in next slide
31. (Appendix) How to do clustering In the case of combination of NE and LC.
Edge set given by NE Edge set given by LC The result of combinationof NE, LC is given byedge set
32. Motivation
Have you ever had trouble when you have used a common name as a query in a search engine? We can access target Web page, but this often forces us to do hard and time consuming work. When different real-world entity (person/ place/ organization) has the same name, the reference from the name to the entity can be ambiguous. Of course, we can disambiguate reference manually. However, this often forces us to do hard and time consuming work. This problem occurs when different real-world entity (person/ place/ organization) has the same name.Of course, we can disambiguate reference manually. However, this often forces us to do hard and time consuming work. This problem occurs when different real-world entity (person/ place/ organization) has the same name.
33. How to calculate metrics
Correct Groping Set of clustering results For each , calculate as follows
34. Definition of Junk Pages
We defined junk pages as: J1. The page has disappeared from the Web. J2. The page does not contain the query string. J3. Most of the page is occupied by the enumerations of names or numbers. J4. The same name on the page refers to multiple entities. These pages are beyond the scope of our task. These pages have no information about query string. These pages are hard to use. Removing junk pages with filtering rules. As I showed, there were meaningless pages on the Web. We aimed to remove these junk pages with filtering rules. First, we defined four types of junk pages as follows. J1 is the page that has disappeared from the Web and J2 is the page that does not contain the query string. These two types of pages don�t have information about query string. Next, J3 is a page most of which is occupied by the enumerations of names or numbers. This type corresponds to list page such as sports results and it�s hard to use. And J4 is the type of Web pages in which the same name refers to multiple entities as the example of wikipedia for Michael Jackson . These pages are beyond the scope of our task.As I showed, there were meaningless pages on the Web. We aimed to remove these junk pages with filtering rules. First, we defined four types of junk pages as follows. J1 is the page that has disappeared from the Web and J2 is the page that does not contain the query string. These two types of pages don�t have information about query string. Next, J3 is a page most of which is occupied by the enumerations of names or numbers. This type corresponds to list page such as sports results and it�s hard to use. And J4 is the type of Web pages in which the same name refers to multiple entities as the example of wikipedia for Michael Jackson . These pages are beyond the scope of our task.
35. Filtering Rules
F1. The URL of Web page contains Japanese characters. F2. The title contains the string ``search result��. F3. Named entities appear too frequently. F4. There is no string corresponding to the query. remove pages with no information about query remove pages which is beyond the scope label ``J3�� and not use information about Named Entities Removing junk pages with filtering rules. To deal with these junk pages, we defined four filtering rules as follows. Rule F1 is the URL contained Japanese characters, and Rule F2 is the title contained the string ``search result��. These rules aimed to remove pages with multiple entities of the same name. Rule F3 is Named entities appear too frequently. Pages satisfying this rule is labeled as ``J3�� and treated as special cases. Rule F4 is there is no string corresponding to the query. This rule aimed to remove pages with no information about the query.To deal with these junk pages, we defined four filtering rules as follows. Rule F1 is the URL contained Japanese characters, and Rule F2 is the title contained the string ``search result��. These rules aimed to remove pages with multiple entities of the same name. Rule F3 is Named entities appear too frequently. Pages satisfying this rule is labeled as ``J3�� and treated as special cases. Rule F4 is there is no string corresponding to the query. This rule aimed to remove pages with no information about the query.
36. Task Definition
Getting query, collect Web pages with query string. Output a set of page clusters.Each pages of a cluster refers to the same entity. Note: We assumed that all of the same query string in the same Web page referred the same entity. Accept person names and place names. Not require any knowledge about the query. This slide shows the definition of our task. First, the system retrieves a set of web pages by the given query. After that, the system outputs clusters of Web page. Each page in a cluster refers to the same entity. (Note that we assumed all of the same string in the same Web page referred the same entity. This assumption makes the problem simpler and prevents the system from complicating. The pages against this assumption was treated as junk pages. I will talk about detail after.) The system accept not only person names, but place names, and the system does not require any knowledge about the query.This slide shows the definition of our task. First, the system retrieves a set of web pages by the given query. After that, the system outputs clusters of Web page. Each page in a cluster refers to the same entity. (Note that we assumed all of the same string in the same Web page referred the same entity. This assumption makes the problem simpler and prevents the system from complicating. The pages against this assumption was treated as junk pages. I will talk about detail after.) The system accept not only person names, but place names, and the system does not require any knowledge about the query.
37. Algorithm for Named Entities Matching
Extract person names and place names with NE tagger. For all document pairs, Calculate NE similarity as: If , then regard the query string appearing on two pages refers same entities. number of person names appearing in both number of place names appearing in both Clustering is done in the same way as LC The Algorithm for Named Entities Matching is as follows. It�s similar to Local Context Matching. First extract person names and place names by Named Entity tagger. And next, calculate named entities similarity by number of person and place names appearing in both of two Web pages. Finally, determine whether two pages refer the same entities or not by threshold, and clustering is done in the same way as Local context Matching. The Algorithm for Named Entities Matching is as follows. It�s similar to Local Context Matching. First extract person names and place names by Named Entity tagger. And next, calculate named entities similarity by number of person and place names appearing in both of two Web pages. Finally, determine whether two pages refer the same entities or not by threshold, and clustering is done in the same way as Local context Matching.

NAYOSE: A System for Reference Disambiguation of Proper Nouns Appearing on Web Pages

NAYOSE: A System for Reference Disambiguation of Proper Nouns Appearing on Web Pages

Presentation Transcript

Nouns

Nouns name…

Nouns

Types of Nouns: Common & Proper Nouns

GRAMMAR LESSON #1

NOUNS

Objectives

Problems with Nouns Meeting 3

Nouns – common v. proper, concrete v. abstract, and collective nouns

NOUNS

Know Your Nouns, Pronouns, & Adjectives

Parts of Speech 2

The Noun Имя с уществительное

What are the types of Nouns?

8 Parts of Speech

Proper Nouns

Common and Proper Nouns

A Guide to Grammar

Parts of Speech

Nouns

Grammar

NAYOSE: A System for Reference Disambiguation of Proper Nouns Appearing on Web Pages