1 / 26

Finding Authoritative People from the Web

Finding Authoritative People from the Web. Masanori Harada, Shin-ya Sato, Kazuhiro Kazama {harada,sato,kazama}@ingrid.org NTT Network Innovation Labs. Contents. Motivation Why study finding people? Examples Approach Extract personal names on the web

Télécharger la présentation

Finding Authoritative People from the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Authoritative People from the Web Masanori Harada, Shin-ya Sato, Kazuhiro Kazama {harada,sato,kazama}@ingrid.org NTT Network Innovation Labs.

  2. Contents • Motivation • Why study finding people? • Examples • Approach • Extract personal names on the web • Find relevant people using a search engine • Results • Performance evaluation • Summary and future plans

  3. Background As the web is connected to the real world, we can: • Find real-world things by searching the web. • Understand the real world by investigating the web (and vice versa). the real world ... connections the web searching

  4. Objective Find authoritative people for all sorts of topics by extending a web search engine • Why find people? • Once people have been found, many other things (e.g. books) can be retrieved using digital libraries • What is authoritative? • People mentioned in many web pages with regard to a queried topic

  5. Screenshot Relevant personal names Relationships Relevant web pages

  6. Example (1) subject to people • “digital libraries” (1007 pages) • Possible application: book finder • Using library catalogs, it could suggest relevant books written by these authoritative people • Shigeo Sugimoto Univ. Library Information Science • Koichi Tabata Univ. Library Information Science • Jun Adachi National Institute of Informatics • Takeo Yamamoto National Institute of Informatics • Hiroyuki Taya National Diet Library

  7. Example (2) thing to people • “Spirited Away” (35,936 pages) • Possible application: movie recommender • Using movie databases, it could suggest movies which share key people for any queried topic • Hayao Miyazaki director • Bunta Sugawara voice actor • Mari Natsuki voice actress • Yumi Kimura singer of theme song • Joe Hisaishi composer

  8. Example (3) person to people • “Masanori Harada” (205 pages) • Possible application: social networking • Unlike social networking services like orkut, there is no need to enter relationships manually • Masanori Harada me • Shin-ya Sato co-author of this paper • Kazuhiro Kazama co-author of this paper • Kent Tamura web search researcher • Isao Asai web search researcher

  9. NEXAS Named entity extraction and association search • Associate an entity and a web page by extracting names identifying the entity. • Find entities associated with top-ranked web pages retrieved for a query. A More relevant (authoritative) More relevant (authoritative) ・・・ B web search Less relevant Less relevant C Irrelevant

  10. Extracting personal names • Web data • 52 million Japanese web pages collected in July 2003 • Japanese personal name extractor • Extracts only full names • Assumes a full name can identify a person • Big name dictionaries enables accurate extraction • Precision: 93.5%, Recall 85.3%

  11. Personal names on the web • Personal names appear frequently on the web • 6.6M unique names extracted from 52M web pages • 1/4 of web pages contain full names • Celebrities appear >10 thousand times • singers, actors, sports stars, novelists, politicians, etc. • Most names appear only a few times • Name frequency indicates popularity • But number of pages is easily affected by automatically generated texts and spams • Number of servers is less affected • Authoritative people are mentioned in many servers

  12. Procedure for finding people 1. Find web pages using a full-text search engine 2. List personal names extracted from top T relevant and authoritative web pages 3. Calculate relevance scores and output top k relevant people, who • Appear frequently on top-ranked web pages • Do not appear frequently on irrelevant web pages

  13. Calculating relevance score Scoring functions: df, sf, dfidf,and sfidf df = document frequency within top T pages sf = server frequency within top T pages + 0.01df idf = log (N / fp) N = number of pages in the collection fp= document frequency sf : Alleviates effects of generated texts and spams idf : Weight for moderating generally famous person

  14. Performance evaluation • Compared 4 scoring functions for varying T • Precision: average of scores of top k people • Judged if a person was relevant (Score: 1) or not (Score: 0) by searching for the personal name using Google • 45 simple topics were used • 15 musical instruments players or not? • 15 sports players or not? • 15 information technologies experts or not?

  15. Precision of scoring functions • sf is very effective • idf is fairly effective sfidf sf dfidf df Precision Number of people evaluated

  16. Precision with varying T • More web pages do not necessary yield better results sfidf T =500 T =1000 T =200 T =2000 T =5000 T =100 Precision Number of people evaluated

  17. Future work • Apply for other languages, especially English • Can we distinguish different “John Smith”? • Find books, companies, shops, etc. • By extracting ISBNs, domain names, places, etc. • Analyze co-occurrences as a social network • Online demo is available athttp://valhalla.ingrid.org:28080/throughout the conference • * Japanese fonts and Java2 plug-in are required

  18. Precison vs. result size • Too-specific or too-general topics are difficult. sfidf T=1000 Precision of top 10 people Number of pages retrieved for a topic “databases” “compiler theory”

  19. Popular Japanese names • singers, actors, sports starts, novelists, politicians, etc. Table: Top 10 most frequent names

  20. Related work • ReferralWeb [Kautz 1997] • Finds experts around the user by searching the web • Tested only with computer science topics • Web Question Answering [Kwok 2001][Brill 2001] • Retrieve one exact answer to a long, complete natural language question • Our contributions • Observed the distribution of personal names on the web • Extended a web search engine so that it accurately finds relevant people for all sorts of queries.

  21. Common failures • Too specific topics • Too general topics • Name extraction errors • Falsely extracted non-names • Missed (not extracted) names • Historical/Fictional characters • Celebrities • Popular names often appear without regard to a specific topic

  22. Numbers • Number of web pages 52,302,805 • Number of web servers 664,139 • Number of pages w/ names 13,922,012 • Number of name occurrences 117,091,977 • Number of unique names 6,161,805 • Total size of web pages 450GB • Size of inverted index 113GB • Size of dictionaries • Family names 21,141 • Personal names 12,130 • Full names 19,675

  23. Topics for the experiment • Musical instruments • violin, cello, trumpet, clarinet, harp, percussion, synthesizer, ocarina, accordion, contrabass, pipe organ, and marimba • Sports • soccer, baseball, marathon, swimming, rugby, volleyball, basketball, boxing, badminton, ice hockey, speed skating, fencing, lacrosse, pole vault, and discus throw • Information technology terms • databases, java, information retrieval, XML, IPv6, speech recognition, P2P, data mining, machine translation, complexity theory, web search engines, probabilistic reasoning, simulated annealing, compiler theory, and randomized algorithms

  24. Name extraction method • Procedure 1. Remove HTML tags. 2. Using a morphological analyzer, split each sentence into morphemes and assign Part-Of-Speech tags. 3. Extract <family name><personal name> sequences. • Performance improved by enriching dictionaries 17k families + 12k personals + 2k popular full names Precision 78.4%, Recall 75.0% 21k families + 40k personals + 19k popular full names Precision 93.5%, Recall 85.3%

  25. The same-name problem • Not very serious when we query a subject • Different people having the same name rarely exist in a specific area • Japanese family/personal names are diverse • Still, not a few people share the same name • Solutions (under consideration) • Classifies web pages by topic • Analyze a social network around the name (Different people have different friends)

  26. Popularity and frequency • Personal name frequencies indicate web users' interest in celebrities. • The number of pages is prone to be affected by automatically generated pages and spams. • The number of servers is better to find popular people.

More Related