1 / 21

Learning Effective Patterns for Information Extraction

Learning Effective Patterns for Information Extraction. Gijs Geleijnse (gijs.geleijnse@philips.com). Overview. my view on O ntology Population / Information Extraction short discussion of global approach wrt Ontology Population a subproblem: learning relation patterns

kaiyo
Télécharger la présentation

Learning Effective Patterns for Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Effective Patterns for Information Extraction Gijs Geleijnse (gijs.geleijnse@philips.com)

  2. Overview • my view on Ontology Population / Information Extraction • short discussion of global approach wrt Ontology Population • a subproblem: learning relation patterns • experiments with learned patterns • conclusions

  3. What’s the problem? Information is freely accessible on the web ... but the information on the `traditional’ web is not interpretable by machines. Goal of my research: Find, extract and combine information on the web into a machine interpretable format

  4. What’s the problem? (2) • Come up with a model for the concept information • 2. Come up with algorithms to populate this model  Ontologies

  5. What’s an ontology?

  6. Populating an ontology • 1. Formulate queries with instance • `U2’s album’. • 2. Collect Google search results: • `U2’s album Pop ..’, • `U2’s album on a flash card’, • `U2’s album How to Dismantle..’ • Identify instances `U2’ producer album artist (Boy), (HtDaAB)... (U2, Boy), (U2, HtDaAB)

  7. Subproblems of OP How to • identify patterns expressing relations? Amsterdam – Netherlands `is the capital of’ • identify instances in the Googled texts? `buy i still know what you did last summer on dvd` • define acceptance functions for instances and relations `they think Amsterdam is the capital of Germany hahahaha’

  8. Identifying effective relation patterns • We want patterns that give many useful results. Three criteria for effectiveness: • A pattern must frequently occur on the web i.e. it must return many results. • A pattern must be precise, i.e. it must return many useful results. • 3. When relation R is one-to-many, a pattern must be wide-spread, i.e. it must return diverse results.

  9. Identifying effective relation patterns • Approach: • Compose training set with related items. • Google them to get a set of patterns • Compute scores for the patterns • Constraint: • - Don’t Google too often!

  10. Retrieving relation patterns We formulate queries with the elements in the trainingset: “Michael Jackson * Thriller”, “Thriller * Michael Jackson” We retrieve all inner-sentence fragments between the instances and normalize them (remove punctuation marks and capitals).

  11. Evaluate relation patterns We now have a (long) list of patterns: [album] by [artist] ; [artist]’s [album] ; [album] album cover by [artist] ; [album] di [artist] ; ......... Now to compute scores: frequency, precision, wide-spreadness

  12. Evaluate relation patterns • Frequency: we take the frequency of the pattern in the list obtained. • Precision: • - we google the pattern in combination with an instance • observe the fraction of useful results • e.g. if we google “ABBA’s new album” we divide the number excerpts with an album title by the total number of excerpts found

  13. Evaluate relation patterns Wide-spreadness: we count the number of different instances found with the query. Score= freq * prec * spr We only compute the scores of the N most frequent patterns. Number of queries: 2 * |training set| + N * |instance set|

  14. Case-study: Hearst Patterns Are the Hearst Patterns indeed the most effective patterns for the is-a relation? (O = ((country, hynonym), ({all countries}, {‘country’, ‘countries’}), is_a, {(Afghanistan,country), (Afghanistan, countries), (Akrotiri, country), (Akrotiri, countries), ...)})

  15. Case-study: Hearst Patterns Both the common Hearst Patterns and relations typical for this setting (countries) perform well.

  16. Case-study: Burger King TREC QA question: In which countries can Burger King be found? O = ((country, restaurant), ({all countries}, {McDonald’s, KFC}), located_in, (McDonald’s, USA), (KFC, China), ...))

  17. Case-study: Burger King We first find patterns using the method described:

  18. Case-study: Burger King • ... And simultaneously find names of restaurants • Capitalized words • Noh(“restaurants like X and”) >= 50

  19. Case-study: Burger King • Finally, we use the patterns found in combination with `Burger King` to find relations. • - Precision 80% • Recall 85% • Most errors due to countries in which Burger King plans to open restaurants.

  20. Conclusions • Automatic Pattern selecting is successful • Simple methods again lead to good results • Recognition of instances and the filtering of erroneous patterns is still a big chalenge • Ontology Population is fun

More Related