1 / 27

Recognition and Classification of Noun Phrases in Queries for Effective Retrieval

Recognition and Classification of Noun Phrases in Queries for Effective Retrieval. Wei Zhang 1 Shuang Liu 2 Clement Yu 1 wzhang@cs.uic.edu shuang.liu@ask.com yu@cs.uic.edu

hunterg
Télécharger la présentation

Recognition and Classification of Noun Phrases in Queries for Effective Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recognition and Classification ofNoun Phrases in Queries for Effective Retrieval Wei Zhang1 Shuang Liu2 Clement Yu1 wzhang@cs.uic.edu shuang.liu@ask.com yu@cs.uic.edu Chaojing Sun3 Fang Liu4 Weiyi Meng5 chaojing@gmail.com fangliu@microsoft.com meng@cs.binghamton.edu 1 Department of Computer Science, University of Illinois at Chicago 2 Ask.com 3 Broadcom Corporation 4 Microsoft 5 Department of Computer Science, Binghamton University CIKM 2007 1

  2. Outline • Motivation • Our definitions of the phrases • Proper noun and dictionary phrase recognition • Simple and complex phrase recognition • Experimental results CIKM 2007 2

  3. Motivation • Terms in a query are related semantically • “John Smith” • Recognize this relationship • Partition the query terms to groups (phrases) • Document retrieval using phrases • Adding phrases into searching and ranking

  4. Types of Noun Phrases • Phrases that have fixed writing formats • Names of Locations, people, companies, … • Well defined concepts. E.g. “computer science” • Freely written phrases • Not formally defined but used in the real language

  5. Four Types of Noun Phrases • Proper Noun (PN) • A noun phrase that names a specific person, place or thing. • First letters of the content words are capitalized • E.g. “John Smith”, “Atlantic Ocean” • Dictionary Phrase (DP) • A phrase that has a definition in a dictionary, excluding PN • These two types may overlap • “Atlantic Ocean” • They can not replace each other • E.g. “Lina’s Pizza”, “public transportation”

  6. Four Types of Noun Phrases • Simple Noun Phrase (SNP) • A grammatically valid noun phrase other than PN and DP • 2 words • E.g. “white car”, “good hotel” • Complex Noun Phrase (CNP) • A grammatically valid noun phrase other than PN and DP • 3 or more words • May contain PN/DP/SNP • E.g. “small white car”, “city public transportation”

  7. Noun Phrase Recognition • General procedure • Recognize PN and dictionary phrases first • Then simple and complex noun phrases • A n-word query • Check the original query • Check the 2 (n-1)-term arrays • … • Check the (n-1) 2-term arrays • Totally n*(n-1)/2 candidates • E.g. “World Trade Organization” • “World Trade” and “Trade Organization”

  8. Noun Phrase Recognition • Tools for phrase recognition • Dictionaries (Wikipedia, WordNet) • Large text corpus (Google for experiments) • Parsers (Minipar, Collins parser) and POS tagger

  9. PN and DP Recognition • Wekipedia • For proper nouns and dictionary phrases • DP: existence of the entry page • PN: content words in the first instance of the phrase in the main text should be capitalized

  10. PN and DP Recognition • WordNet • For PN and DP recognition • DP: defined in a dictionary • PN: has a hypernym of city, province, country, organization, geographic area, person, syndrome, region, building, or nation.

  11. PN and DP Recognition • Minipar • For PN recognition only • “PN” label in the parse tree • Semantic label of person, country, corpname, location, corpdesig, fname, gname, or date

  12. PN and DP Recognition • List of first names, last names and rules • First_initial last_name • First_initial mid_initial last_name • First_name middle_initial last_name • First_name last_name

  13. PN and DP Recognition • Text corpus • For less well-known PNs • Three instances, first letters of the content words capitalized • Not a sub-phrase of a longer PN • “if you choose windows by Vista Window Company, …” • “if you choose windows by Super Vista Window Company, …”

  14. PN and DP Recognition • Overlapped phrases • Search all words together • Count the instances of each phrase in the returned documents • e.g. “Native American Casino” • “Native American” and “American Casino” • Compare ( Count(“Native American”), Count(“American Casino”) )

  15. SNP and CNP Recognition • Only check the phrase candidates that • are not sub-phrases of a recognized PN/DP • do not overlap with a recognized PN/DP

  16. SNP and CNP Recognition • Implicit phrases • “and” / “or” • “main and contributing factor”  • “main factor” • “contributing factor”

  17. SNP and CNP Recognition • Head word replacement • Replace the whole phrase by its head word • Collins parser • Label the noun phrases NP/sedan(head word) NP/sedan(head word) Best/JJS Compact/JJ Sedan/NN

  18. SNP and CNP Recognition • Phrase verification • To verify that a phrase is used in the world • For CNP: it also means to find all the words in a text window • “Colin Farrell wallpaper” and “wallpaper of Colin Farrell”

  19. SNP and CNP Recognition • Overlapped phrases • Two potential SNP/CNP: Search all words, compare the numbers of the instances. • “sony dvd handyam”  “sony dvd” and “dvd handycam”

  20. Document Retrieval Using Phrases • Search a phrase in a document • Exact match: PN/DP • Search all words in a text window: SNP/CNP

  21. Document Retrieval Using Phrases • Sim(Query, Doc) = <Sim_P, Sim_T> • Phrase similarity • Sim_P(P_i) = idf(P_i) • Sim_P = sum ( sim_P(P_i) ) • Term similarity • Okapi/BM-25 similarity • Document ranking • D1 is ranked higher than D2, if • (Sim_P1>Sim_P2) OR (P1=P2 AND T1>T2)

  22. Experimental Results • Phrase recognition experiments • Tuned by using TREC queries

  23. Experimental Results • Phrase recognition experiments • Tested by using Web queries

  24. Experimental Results • Performance of individual tools • Wikipedia is better than WordNet and Minipar • Need for a complete dictionary • Collins parser alone is not enough for SNP/CNP recognition • Lack of real world usage information

  25. Experimental Results • Document retrieval experiments • Ad-hoc TREC 6, 7 and 8, robust TREC 12, 13 and 14 • Retrieval without using phrases • Using Wikipedia for PN/DP and just collins parser for SNP/CNP • Using phrases from the full recognition algorithm • 33% MAP increase and 44.27% GMAP increase from 1 to 2 • 5.8% MAP increase and 12.58% GMAP increase from 2 to 3

  26. Conclusions • Our algorithm can effectively recognize the four types of phrases in the short Web queries • The recognized phrases help improve the retrieval effectiveness

  27. Questions? • wzhang@cs.uic.edu • http://www.cs.uic.edu/~wzhang/

More Related