1 / 43

Named EntIty Recognition in Query

Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li. Named EntIty Recognition in Query. SIGIR 2009. Presentation by Gonçalo Simões Course: Recuperação de Informação. Outline. Basic Concepts Named Entity Recognition in Query Conclusions. Outline. Basic Concepts Information Extraction

malise
Télécharger la présentation

Named EntIty Recognition in Query

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li NamedEntItyRecognitioninQuery SIGIR 2009 Presentation by Gonçalo Simões Course: Recuperação de Informação

  2. Outline • Basic Concepts • Named Entity Recognition in Query • Conclusions

  3. Outline • Basic Concepts • Information Extraction • Named Entity Recognition • Named Entity Recognition in Query • Conclusions

  4. Information Extraction • Information Extraction (IE) proposes techniques to extract relevant information from non-structured or semi-structured texts • Extracted information is transformed so that it can be represented in a fixed format

  5. Named Entity Recognition • Named Entity Recognition (NER) is an IE task that seeks to locate and classify text segments into predefined classes (e.g., Person, Location, Time expression)

  6. Named Entity Recognition CENTER FOR INNOVATION IN LEARNING (CIL) EDUCATION SEMINAR SERIES Joe Mertz & Brian Mckenzie Center for Innovation in Learning, CMU ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround 15h. We hope to see you in these sessions Please direct questions to Pamela Yocca at 268-7675.

  7. Named Entity Recognition Classes/entities: Person Location Temporal Expression CENTER FOR INNOVATION IN LEARNING (CIL) EDUCATION SEMINAR SERIES Joe Mertz & Brian Mckenzie Center for Innovation in Learning, CMU ANNOUNCEMENT: We are proud to announce that this Friday, February 17, we will have two sessions in our Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends arround15h. We hope to see you in these sessions Please direct questions to Pamela Yocca at 268-7675.

  8. NER in IR • NER has been used for some IR tasks • Example: NER + Coreference resolution When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six ...

  9. NER in IR • NER has been used for some IR tasks • Example: NER + Coreference resolution • Instead of using a bag of words explore the fact that the highlighted entities correspond to the same real world entity When Mozart first arrived in Vienna, he’d get up at 6am, settle into composing at his desk by 7, working until 9 or 10 after which he’d make the round of his pupils, taking a break for lunch at 1pm. If there’s no concert, he might get back to work by 5 or 6pm, working until 9pm. He might go out and socialize for a few hours and then come back to work another hour or two before going to bed around 1am. Amadeus preferred getting seven hours of sleep but often made do with five or six ...

  10. Outline • Basic Concepts • NamedEntityRecognitioninQuery • Introduction • NERQ Problem • Notation • ProbabilisticApproach • ProbabilityEstimation • WS-LDA Algorithm • TrainingProcess • Experimental Results • Conclusions

  11. Introduction • 71% of the queries in search engines contain named entities • These named entities may be useful to process the query

  12. Introduction • Motivating Examples • Consider the query “harry potter walkthrough” • The context of the query strongly indicates that the named entity “harry potter” is a “Game” • Consider the query “harry potter cast” • The context of the query strongly indicates that the named entity “harry potter” is a “Movie”

  13. Introduction • Identifying named entities can be very useful. Consider the following examples related to the query “harry potter walkthrough”: • Ranking: Documents about videogames should be pushed up in the rankings (Altavista search) • Suggestion: Relevant suggestions can be generated like “harry potter cheats” or “lord of the rings walkthrough”

  14. NERQ Problem • Named Entity Recognition in Query (NERQ) is a task that tries to detect the named entities within a query and categorize it into predefined classes • The work that was previously performed in this area was focused on query log mining and not on query processing

  15. NERQ Problem • NER vs NERQ • The techniques used in NER are adapted for Natural Language texts • They do not have good results for queries because: • queries only have 2-3 words on average • queries are not well formed (e.g., all letters all typically lower case)

  16. Notation • A single-named-entity query q can be represented as a triple (e,t,c) • e denotes a named entity • t denotes the context • A context is expressed as α#β where αandβ denotes the the left and right context respectively and # denotes a placehoder for the named entity • c denotes the class of e • Example • “harry potter walkthrough” is associated to the triple (“harry potter”, “# walkthrough”, “Game”)

  17. Probabilistic Approach • The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e • Goal: Find (e,t,c)* such that: (e,t,c)* = argmax(e,t,c) P(q,e,t,c)

  18. Probabilistic Approach • The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e • Goal: Find (e,t,c)* such that: (e,t,c)* = argmax(e,t,c) P(q | e,t,c) P(e,t,c)

  19. Probabilistic Approach • The goal of NERQ is to detect the named entity e in query q, and assign the most likely class c, to e • Goal: Find (e,t,c)* such that: (e,t,c)* = argmax(e,t,c) ϵ G(q) P(e,t,c)

  20. Probabilistic Approach • For each triple (e,t,c) ϵ G(q), we only need to compute P(e,t,c) P(e,t,c) = P(t,c | e) P(e)

  21. Probabilistic Approach • For each triple (e,t,c) ϵ G(q), we only need to compute P(e,t,c) P(e,t,c) = P(t | c,e) P(c | e) P(e)

  22. Probabilistic Approach • For each triple (e,t,c) ϵ G(q), we only need to compute P(e,t,c) P(e,t,c) = P(t | c) P(c | e) P(e) • How to estimate these probabilities?

  23. Probability Estimation • P(t | c), P(c | e) and P(e) can be estimated through training • The input for the training process is: • Set of seed named entities with the respective classes • Query log

  24. Probability Estimation • Consider the existence of a training data set with N triples from labeled queries T = {(ei,ti,ci) | i=1,…,N} • With this training data set, the learning problem can be formalized as:

  25. Probability Estimation • Buildingthetraining corpus for fullquerieswouldbedifficultandtime-consumingwheneachnamedentitycanbelong to several classes • A solutionis to collecttraining data as: T = {(ei,ti) | i=1,…,N} andthelistofpossible classes for eachnamedentityintraining • Withthistraining data set, thelearningproblemcanbeformalized as:

  26. Probability Estimation • Buildingthetraining corpus for fullquerieswouldbedifficultandtime-consumingwheneachnamedentitycanbelong to several classes • A solutionis to collecttraining data as: T = {(ei,ti) | i=1,…,N} andthelistofpossible classes for eachnamedentityintraining • Withthistraining data set, thelearningproblemcanbeformalized as:

  27. Probability Estimation • P(t | c) and P(c | e) canbepredictedusing a TopicModel • Thereis a relationshipbetweenTopicModeland NERQ notions • Withoutlossofgenerality, theauthorsdecided to use a variationof LDA called WS-LDA

  28. WS-LDA Algorithm • Unsupervised learning methods for topic model would not work in NERQ • WS-LDA introduces weak supervision for training by using a set of named entity seeds • It is assumed that a named entity has high probabilities on labeled classes and very low probabilities on unlabeled classes

  29. WS-LDA Algorithm • Objective function for eachnamedentity O(e|y,Θ) = log P(w | Θ) +λC(y, Θ) • y,binary vector thatassignsanentity to therespective classes • Θ = {α,β}, parametersoftheDirichletdistributionandtheMultinomialdistributionusedintheprocess • λ, coeficientgivenbytheuserthatindicatestheweightofthesupervisionconstraints • C(y, Θ), constraintfunction

  30. Training Process • Thetrainingprocessisdividedintotwosteps: • Findqueriesofthequerylogcontatiningthenamedentityseeds • Generatethecontextsassociated to thenamedentityseedsinthequeries • Generatethequerytrainingdata (ei,ti) to trainthe WS-LDA topicmodel • Use thetopicmodel to learn P(t|c) • Scanthequerylogwiththepreviouslygeneratedcontexts to extractnewnamedentities • Use thetopicmodel to learn P(c|e) for eachnewentity • Estimate P(e) withthefrequencyofeinthequerylog

  31. Outline • Basic Concepts • Named Entity Recognition in Query • Experimental Results • Data Set • NERQ by WS-LDA • WS-LDA vs Baselines • Supervision in WS-LDA • Conclusions

  32. Data Set • 6 billion queries • Four semantic classes: “Movie”, “Game”, “Book” and “Music” • 180 seed named entity from Amazon, Gamespot and Lyrics annotated by four Human Beings • 120 named entities for training • 60 named entities for testing

  33. Data Set • After training a WS-LDA model with the 120 seed named entities: • 432.304 contexts • About 1.5 million named entities

  34. NERQ by WS-LDA • NERQ conducted on queries from a separate query log with about 12 million queries • 140.000 recognition results • Evaluation with 400 randomly sampled queries

  35. NERQ by WS-LDA • Three types of errors: • Inacurate estimation of P(e) • Uncommon contexts that were not learned • Queries containing named entities out of the predefined classes

  36. WS-LDA vs baselines • Comparison between WS-LDA and two other approaches: • A deterministic approach that learns the contexts of a class by aggregating all the contexts of named entities of the class • Latent Dirichlet Allocation

  37. WS-LDA vs baselines • Modeling Contexts of classes

  38. WS-LDA vs baselines • Modeling Contexts of classes

  39. WS-LDA vs baselines • Class prediction

  40. WS-LDA vs baselines • Convergence speed

  41. Supervision in WS-LDA • How can λ affect the performace of WS-LDA?

  42. Outline • Basic Concepts • Named Entity Recognition in Query • Experimental Results • Conclusions

  43. Conclusions • NERQ ispotentiallyusefulinmanysearchapplications • Thispaperis a firstapporach to NERQ andproposed a probabilisticapproach to performthistask • WS-LDA ispresented as na alternative to LDA • Experimental resultsindicatethattheproposedapproachcanaccuratelyperform NERQ

More Related