1 / 37

Papers for today

Papers for today. Collaboratively built semi-structured content and Artificial Intelligence: The story so far Hovy , Navigli , Ponzetto YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia Hoffarta , Suchanekb , Berbericha , Weikuma.

gloria
Télécharger la présentation

Papers for today

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Papers for today • Collaboratively built semi-structured content and Artificial Intelligence: The story so far • Hovy, Navigli, Ponzetto • YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia • Hoffarta, Suchanekb, Berbericha, Weikuma

  2. Collaboratively built semi-structured content • Main characteristics of collaborative resources that make them attractive for AI and NLP research • Semi-structured resources enable a renaissance of knowledge-rich AI techniques

  3. Unstructured, structured and semi-structured resources • Unstructured • Strengths: easy to harvest at very large scale, many domains, many styles, many languages… • Limitations: knowledge acquisition bottleneck (for complex inference chains), degree and quality of ontologization • Structured (e.g. ontologies…) • Strengths: high quality, beneficial for all kinds of intelligent applications. • Limitations: Creation and maintenance effort, Coverage, up-to-date information, the language barrier, low coverage • Semi-structured • Strengths: high quality and coverage, up-to-date and multilingual

  4. Semi-structured resources • Wikipedia, Wiktionary, Twitter, Yahoo! Answers • Wikipedia • relies on large amounts of manually-input knowledge • provided via massive online collaboration • on the basis of semi-structured (i.e., free-form markup) content • Structure given by redirection pages, internal hyperlinks, interlanguage links, category pages, infoboxes • Markup annotations indirectly encode semantic content and, thus, world and linguistic knowledge manually input by human editors

  5. Filling the knowledge gap • Transforming semi-structured content into machine-readable knowledge • Generating semantics by exploiting the shallow structure found in Wikipedia • Acquiring related terms: thesaurus extraction • Is-a relation: taxonomy induction • Relation extraction • sentences processing combined with hyperlink information, use of infoboxes

  6. Filling the knowledge gap • Ontologization: building and enriching ontologies (YAGO2) • More relations (meronomy, domain-specific…) • Exploiting structure. Some of the methods quantify semantic distances using a relatedness measure computed on the Wikipedia hyperlink graph • A heuristic renaissance: High-quality, semi-structured content enables the acquisition of machine-readable knowledge on a large scale by means of heuristic methods which essentially leverage regularities found within their shallow structure. • Lightweight and scalable rule-based approaches can be devised to exploit the conventions governing the editorial base of collaboratively-generated resources, and capture large amounts of semantic information hidden within them.

  7. Filling the knowledge gap • Named Entity Recognition • Named Entity Disambiguation (associate name with appropriate reference) • Word Sense Disambiguation • Wikification: bringing Entity and Word Sense Disambiguation together • keyword extraction combined with lexical disambiguation: given an input document, a wikification system identifies the most important terms in the document and links (i.e., disambiguates) them to their appropriate entries within an external encyclopedic resource, i.e., typically Wikipedia.

  8. Filling the knowledge gap • Computing semantic relatedness: quantifying the strength of association between words. • And beyond the sentence level: • Document clustering and text categorization • Question Answering • YAGO2 includes an extrinsic evaluation of the quality of Wikipedia on the task of answering spatio-temporal questions

  9. Filling the knowledge gap • Information Retrieval • The repository of disambiguated concepts found in Wikipedia (i.e., its articles) provides a semantic space into which documents and queries can be projected in order to perform semantic retrieval beyond the simple bag-of- words model

  10. Exploiting updated content from revision history • Language generation • Leveraging Wikipedia’s revision history as a source of data in order to automatically acquire sentence rewriting models.

  11. Exploiting updated content from revision history • Rewriting tasks: sentence compression, text simplification and targeted paraphrasing • Summarization • ??

  12. The tower of Babel: multilingual resources and applications • Wikipedia’s multilinguality – namely, the availability of interlinked wikipedias in different languages – enables the acquisition of very large, wide-coverage repositories of multilingual knowledge. • Multilingual taxonomies and ontologies • Parallel corpora and thesauri

  13. Some Questions • Tease out the collaborative vs. semi-structured aspects • Collaborative • Over the past decade, a variety of proposals -- MindPixel8 and Open Mind9 – have tried to make manual knowledge acquisition feasible by collecting input from volunteers. See also Von Ahn, which aims at acquiring knowledge from users by means of online games. However, none of these efforts, to date, has succeeded in producing truly wide-coverage resources able to compete with standard manual resources. • Why? Why people like to collaborate on Wikepedia and not --as much-- on other projects? • What makes Wikepedia so attractive and how can one try to “copy” from it to encourage other collaborative efforts?

  14. Some Questions • Semi-structured • Wikipedia, Wiktionary, Twitter, Yahoo! Answers • What aspects of the structures are most important? • Other resources that have similar structure –if not the collaborative aspects? • News papers? • Forums? • Use revision history to discover something about the contributors?

  15. Papers for today • Collaboratively built semi-structured content and Artificial Intelligence: The story so far • Hovy, Navigli, Ponzetto • YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia • Hoffarta, Suchanekb, Berbericha, Weikuma

  16. YAGO2 • Knowledge base, in which entities, facts, and events are anchored in both time and space. • YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. • It contains 447 million facts about 9.8 million entities. • Paper describes the extraction methodology, the integration of the spatio-temporal dimension, and the knowledge representation SPOTLto include time and space

  17. Time and space • To know not only that a fact is true, but also when and where it was true. • Presidents of countries or CEOs of companies change. Even capitals of countries or spouses are not necessarily forever…. • The geographical location is a crucial property not just of physical entities such as countries, mountains, or rivers, but also of organization headquarters, or events such as battles, fairs, or people’s births.

  18. Contributions • Integrate entity-relationship-oriented facts with the spatial and temporal dimensions. • Extensible framework for fact extraction (from Wikipedia and other sources) that can tap on infoboxes, lists, tables, categories, and regular patterns in free text, and allows fast and easy specification of new extraction rules • Knowledge representation model tailored to capture time and space, as well as rules for propagating time and location information to all relevant facts • New representation model, SPOTL tuples (SPO + Time + Location) with expressive and easy-to-use querying • SPO triples: subject-property-object triples

  19. YAGO • The YAGO knowledge base is automatically constructed from Wikipedia. • Each article in Wikipedia becomes an entity in the knowledge base (e.g., since Leonard Cohen has an article in Wikipedia, LeonardCohenbecomes an entity in YAGO) • 100 manually defined relations(wasBornOnDate, locatedIn…) • 2 million entities and 20 million facts. • Facts: triples of an entity, a relation, and another entity (wasBornIn(LeonardCohen, Montreal)) • SPO triples of subject (S), predicate (P), and object (O), in compatibility with the RDF data model(Resource Description Framework)

  20. YAGO2 Extraction Architecture • The YAGO2 architecture is based on declarative rules that are stored in text files. • The rules take the form of subject-predicate-object-triples, so that they are basically additional YAGO2 facts. • Extraction rules say that if a part of the source text matches a specified regular expression, a sequence of facts shall be generated. • Wikipedia infoboxes, but also to Wikipedia categories, article titles, headings, links, or references. • The extraction rules cover some 200 infobox patterns, some 90 category patterns, and around a dozen patterns for dealing with disambiguation pages.

  21. Time in YAGO2 • YAGO2 contains a data type yagoDate that denotes time points, typically with a resolution of days but sometimes with cruder resolution like years. • YAGO2 assigns begin and/or end of time spans to all entities, to all facts, and to all events, if they have a known start point or a known end point.

  22. Entities and Time • Entities are assigned a time span to denote their existence in time. Four major entity types: • People • relations wasBornOnDate and diedOnDate demarcate their existence times • Elvis Presley is associated with 1935-01-08 as his birthdate and 1977-08-16 as his time of death. Bob Dylan, is associated only with the time of birth, 1941-05-24 • Groups such as music bands, football clubs, universities, or companies • the relations wasCreatedOnDateand wasDestroyedOnDate demarcate their existence times • Artifacts such as buildings, paintings, books, music songs or albums • wasCreatedOnDateand wasDestroyedOnDate (e.g., for buildings or sculptures) • Eventssuch as wars, sports competitions like Olympics or world championship tournaments, or named epochs like the “German autumn” • startedOnDateand endedOnDate demarcate their existence times

  23. Facts and Time • The YAGO2 extractors can find occurrence times of facts from the Wikipedia infoboxes. • Example:BobDylanwasBornIn Duluth is an event that happened in 1941 • Two new relations, occursSince and occursUntil • If the same fact occurs more than once, then YAGO2 will contain it multiple times with different ids. For example, since Bob Dylan has won two Grammy awards, we would have #1: BobDylanhasWonPrizeGrammyAwardwith #1 occursOnDate 1973, and a second #2: BobDylanhasWonPrizeGrammyAward(with a different id) and the associated fact #2 occursOnDate 1979.

  24. Space • All physical objects have a location in space. • YAGO2 is concerned with entities that have a permanent spatial extent on Earth – for example countries, cities, mountains, and rivers. • New class yagoGeoEntity, which groups together all geo-entities • Subclasses of yagoGeoEntityare: location, body of water, geological formation, real property, facility, excavation, structure, track … • The position of a geo-entity can be described by geographical coordinates, latitude and longitude • YAGO2 harvests geo-entities from two sources: Wikipedia and GeoNames • (GeoNames has information on location hierarchies (partOf), e.g. Berlin is located in Germany is located in Europe and provides alternate names for each location, as well as neighboring countries)

  25. Entities and Location • Events • Can take place at a specific location, such as battles or sports competitions, where the relation happenedIn holds the place where it happened. • Groups or organizations • Can have a venue, such as the headquarters of a company or the campus of a university. The location for such entities is given by the isLocatedIn relation. • Artifacts that are physically located somewhere • E.g. like the Mona Lisa in the Louvre, where the location is again isLocatedIn.

  26. SPOTL(X)-View Model • SPOTLX 6-tuples • SPO triples augmented by Time and Location and keywords or key phrases from the conteXt of sources where the original SPO fact occurs

  27. Size of YAGO2: entities

  28. Size of YAGO2: facts

  29. Evaluation • Of extraction of facts from Wikipedia

  30. Task-Based Evaluation • Answering Spatio-Temporal Questions • 15 questions of the GeoCLEF2008 GiKiPPilot3 • The original intent of the GeoCLEFGiKiP Pilot is: “Find Wikipedia entries / articles that answer a particular information need which requires geo- graphical reasoning of some sort.” • 4 questions working perfectly; • 3 questions working when relaxing a geographical condition from structural to keyword conditions – resulting in a less precise but still useful result set; • 6 questions that could be well formulated as SPOTLX queries but did not return any good result for the limited coverage of the knowledge base; • 2 questions that could not be properly formulated at all. • A sample of temporal and spatial questions blocks from Jeopardy!

  31. Evaluation on Jeopardy

  32. Improving Named Entity Disambiguation by Spatio-Temporal Knowledge • “Dylan performed Hurricane about the black fighter Carter, from his album Desire. That album also contains a duet with Harris in the song Joey.” • Here, the tokens “song”, “album”, and “performed” are strong cues for Joey (Bob Dylan song) instead of the TV series

  33. Spatial Coherence • Two entities that are geographically close to each other are a coherent pair, based on the intuition that texts or text passages (news, blog postings, etc.) usually talk about a single geographic region. • Spatial Coherence is defined between two entities e1,e2 ∈ E with geo- coordinates, where E is the set of all candidates for mapping mentions in a text to canonical entities

  34. Temporal Coherence • Defined between two entities e1,e2 ∈ E with existence time where cet( ) is the center of an entity’s existence time interval, and the denominator normalizes the distance by the maximum distance of any two entities in the current set of entity candidates, ei,ej ∈ E. • The intuition is that a text usually mentions entities that are clustered around a single or a few points in time

  35. Named Entity Disambiguation • Calculate Spatial and Temporal Coherence between the mention in the input text and all candidates entities in the knowledge base • In the weighted formula for the entities relatedness

More Related