1 / 82

Extracting stories from heterogeneous information sources

Extracting stories from heterogeneous information sources. V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello Univ. of Napoli, Italy. Talk Outline. Motivating examples Story Architecture The Model Conclusions. STORY Participants.

flavio
Télécharger la présentation

Extracting stories from heterogeneous information sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting stories from heterogeneous information sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello Univ. of Napoli, Italy

  2. Talk Outline • Motivating examples • Story Architecture • The Model • Conclusions KF Workshop

  3. STORY Participants • Joint research project • University of Maryland, College Park, USA • V.S. Subrahmanian • M. Fayzullin • Amelia Sagoff • Università di Napoli, Federico II • Antonio Picariello • Massimiliano Albanese • Carmine Cesarano KF Workshop

  4. Motivating example: Pakistani Nuclear Scientists • Nuclear proliferation is the issue of the day • Complex web of • Nuclear scientists • Personnel at weapons locations • Arms dealers • Customs officials • Shipping companies • Front companies • Manufacturers • … • Nuclear monitors may want the “story” on any person or place or event to decide if further investigation is warranted. Only the relevant data should be presented to the analyst. KF Workshop

  5. Soldier in Baghdad sees a car pulling up towards a checkpoint. Wants the quick story on: Owner of the car Associates of the car’s owner Estimated threat. Soldier is driving a truck. Wants the quick story on his route: Are certain intersections dangerous? Are the residents sympathetic to US troops Are there nearby friendly units? Any recent reports of gunfire? Any suspicious change in activity levels? Motivating example: soldier in Baghdad Only the relevant data should be presented to the soldier. KF Workshop

  6. Motivating example: US Immigration • Customs official sees a traveller • Wants the quick story on him • Where does he work? • Who does he work for? • What is his area of expertise? • Any warrants? • Is he on a watch list? • Who are his associates – anyone suspicious? • Just the right data should be presented to him. KF Workshop

  7. A motivating example: Pompeii • Pompeii is a spectacular archaeological site. • Visitor experience can be greatly improved by: • Automatically notifying visitors of interesting phenomena without posting extra signs • Allowing visitors to explore the stories of various monuments, paintings, sculptures, etc. in Pompeii. • Allowing visitors to explore the stories of the characters, events and places depicted in these monuments, paintings, sculptures, etc. • Visitors interests vary – so information about exhibits must adapt in real time to their interests to enhance the experience of the visitor. KF Workshop

  8. 3 Applications • [75% done] Pompeii • [Preliminary demo available, about 50% done] Pakistani Nuclear scientists • [Just initiated – demo expected in Jan 2004] Tribes and tribal leaders in the Pakistan/Afghanistan Borderlands KF Workshop

  9. Pompeii Visitors Visitor arrives at ticket counter and buys ticket. KF Workshop

  10. Pompeii Visitors ANALOG: Soldier in Baghdad sets out on a mission. Visitor arrives at ticket counter and buys ticket. KF Workshop

  11. Pompeii Visitors Ticket agent asks if they would like to use the story facility and if they would like to use their cell phone and/ or PDA to get stories of interest to them. KF Workshop

  12. Pompeii Visitors ANALOG: Soldier in Baghdad chooses to receive stories on his radio or PDA. Ticket agent asks if they would like to use the story facility and if they would like to use their cell phone and/ or PDA to get stories of interest to them. KF Workshop

  13. Pompeii Visitors As visitor walks through Pompeii, STORY identifies where he is and predicts where he might go in the future (probabilistically). Ex. if he is at location L, it might predict that he will go to the House of the Vetti. KF Workshop

  14. Pompeii Visitors ANALOG: As soldier drives through Baghdad, STORY identifies where he is and Correlates where he will go with his route plan. As visitor walks through Pompeii, STORY identifies where he is and predicts where he might go in the future (probabilistically). Ex. if he is at location L, it might predict that he will go to the House of the Vetti. KF Workshop

  15. Pompeii Visitors See items You are here (Triclinium in the House of the Vetti) Based on this prediction of where he might go in future, it identifies potential stories he might be interested in and downloads parts of these stories to his PDA/cell. E.g. It might download stories about Pentheus. KF Workshop

  16. Pompeii Visitors ANALOG: STORY finds stories satisfying the soldier’s conditions of interest and downloads them to his PDA or to the nearest radio broadcast location. See items You are here (Triclinium in the House of the Vetti) Based on this prediction of where he might go in future, it identifies potential stories he might be interested in and downloads parts of these stories to his PDA/cell. E.g. It might download stories about Pentheus. KF Workshop

  17. Pompeii Visitors The visitor chooses which story he is interested in. STORY dynamically generates the story and delivers it to the user’s PDA/cell phone, e.g. user might choose story of Pentheus. KF Workshop

  18. Pompeii Visitors ANALOG: STORY delivers the story to the soldier. He can then further interact with the story if needed using voice and cursor prompts. The visitor chooses which story he is interested in. STORY dynamically generates the story and delivers it to the user’s PDA/cell phone, e.g. user might choose story of Pentheus. KF Workshop

  19. Pompeii Visitors The user can choose to explore the story in greater detail (e.g. if he is seeing the story of Pentheus, he can also explore the story of Agave). KF Workshop

  20. The system STORY Spatio-Temporal Object RepositorY KF Workshop

  21. Story • A story is a narrative, true or presumed to be true, relating to important events and celebrated persons of a more or less remote past; a historical relation or anecdote. (Oxford English Dictionary). • We adopt the view that narratives in the context of computing are really interactive multimedia presentations. • Such a view allows a straight piece of text to be a special case of a narrative, or a straight piece of speech to be a narrative. KF Workshop

  22. Considerations about stories • The concept of story is dramatically different for the examples mentioned earlier. • A visitor to Pompeii cares about mythological, historical, artistic facts. • Soldier in Baghdad cares about security and mission related facts. Who are the people around me and not who is depicted on the walls. • Nuclear analyst cares about the nuclear networks – who is selling what to whom? Who is moving the money? What front companies are involved? • What goes into a story depends not only on basic facts about entity of interest but also on the application domain and specific items of interest to the user. KF Workshop

  23. STORY System • STORY is a system for • extracting story content from multiple data distributed sources (databases, web pages, digitized historical documents, maps, etc.) • creating a succinct story based on the above content that adapts to user preferences and interests in real time and • delivering these stories to users across both wireless, wired, and cellular networks and multiple output devices. KF Workshop

  24. Story Architecture KF Workshop

  25. Main Components • STORY application developer component. • what data sources should be accessed in order to produce stories, and what criteria define a good story. • It includes specifications • of context • when stories should be generated. • STORY end user component • what hardware she would like her stories to be rendered on (e.g. PDA, laptop, cell phone), what constitutes a “good” story and methods to analyze collections stories and render judgements about them. KF Workshop

  26. The “Death of Pentheus” painting • Who was Pentheus? • Who punished him? • Who punished him? • Why he was punished? • What do we know about his family? • Was this event depicted by other artists at the same period or in earlier periods or in later periods in the same or different geographical region? • What is the story behind the Vetti? KF Workshop

  27. Entities • Entity: Describes an “object” of interest. • All the known people depicted via images and sculptures • People related in some way • Places • In the case of the soldiers in Baghdad, terrors groups, front companies etc. • There is no need to enumerate this set of entities. They are dynamically created in STORY. Inside Story KF Workshop

  28. Attributes • We assume the existence of some set A whose elements are called attributes. • An attribute A in A has a domain dom(A). • The set of ordinary attributes is associated with the set of entities E iff E Adom(A) • …. Each entity can be characterized by the values of an ordinary attribute! • Example: • Attribute: mother, Value: Agave • Attribute: cartag, Value: AMD 124 • Attribute: employers, Value = {ibm, hp } Inside Story KF Workshop

  29. Temporal attributes • Time Varying Attribute (TVA) = (A, dom(A)) • Timevalue for (TVA) = a set of triples (vi, Li, Ui) • Vi values; Li, Ui integer or UNKNOWN () • Must satisfy the requirement that an attribute does not have two distincts values at the same time. • Example: • attribute: job • Value = { (cardinal, 1500,1509), (pope,1510,1545)} • Example: • Attribute: worked-for • Value = {(ibm,1990,1998), (hp,1999,2004)} Inside Story KF Workshop

  30. Story Schema Set of entities Set of attributes of interest • A story schema is a pair (E,A) • Examples • Set of entities in Pompeii: • Set of all objects in Pompei • Set of all objects and events depicted • Any entities related to the previous categories. • Set of all people/organizations associated with Iraqi cars • Set of all car ids • Set of owners of such cars • Set of people associated with such owners via one or many links. Inside Story KF Workshop

  31. Story Instance • An instance w.r.t. story schema (E,A) is a partial mapping • Input: • an entity of E and an attribute of A • Output: • a value v in dom(A) if A is an ordinary attribute, or • a timevalue if A is a TVA Inside Story KF Workshop

  32. Example • Pentheus was a Greek king who was an enemy of the god Bacchus. Angered by this, the Maenads (who were priestesses worshipping Bacchus) transformed Pentheus into an animal and had his mother, Agave, kill him. • A story schema (together with associated values) for this could be the following: • Occupation: is a time-varying attribute specifying Pentheus' occupation. • The value of this attribute could be king which says that he was king at an unknown time. • Enemy: is a time-varying attribute specifying who were enemies of Pentheus. • The value of this attribute could be Bacchus, Maenads. Notice that Bacchus and the Maenads are other entities. • Punishment: is a time-varying attribute specifying the punishments of Pentheus. • The value of this attribute could be “ transformed into an animal”,”killed” • Mother: is an ordinary attribute having the value : “Agave”. Inside Story KF Workshop

  33. Example: US Immigration • Entity: a visitor to the US • Attributes: • Name • Citizenship • Passport-number • Photo • Biometric attributes • Purpose of visit • Countries travelled to (TVA) • Area of technical interests • Known suspicious affiliations KF Workshop

  34. Pentheus Story Irrelevant time value Inside Story KF Workshop

  35. How the system works • The story application developer first specifies a set of data sources that are to be accessed. • www • a relational database • an object oriented database • database of web documents • Flat files • a set of URLs • Some combination of the above. KF Workshop

  36. KF Workshop

  37. How the system works (2) • The story application developer then specifies a set of properties (not their values) of a place or a person or an artifact or an event that an end-user might be interested in. • The properties of interest may be things like father, mother, occupation, collaborators and so on. • Associates priorities with the properties – these depend on his application needs. KF Workshop

  38. Attribute Extractor • Uses the mediator as well as WordNet to ask queries to appropriate data sources. • It extracts information about the values of the attributes involved. • For example, in our Pentheus application, the attribute extractor accesses HTML pages and extracts from those pages, the names of all entities involved, and for each such entity, it tries to check whether a given attribute has a value. • We have also defined algorithms to extract information from relational, flat files and XML sources. KF Workshop

  39. Attribute Extractor (2) • Results returned by the attribute extractor • a set of (entity, attribute, value) triples • a set of such triples with an associated time stamp - • can be stored in an RDF database • Or relational DBMS or an XML DBMS. • We have also implemented a web spider that can crawl over a set of data sources and populate the attribute database. KF Workshop

  40. Source Access Table (SAT) • We assume that our data sources have an associated application program interface (API) • The SAT describes how to extract an attribute's value using a source's API • A SAT- tuple is (A,s,fA,s) • fA,s is a partial function (body of software code) that maps objects to values or time values • A SAT table is a finite set of SAT-tuples • Basically SAT specifies what code (fA,s) to use to extract values of attribute A w.r.t. source s. • Size of SAT is at most O(m*n) where m is the number of sources and n is the number of attributes. • Methods to process such f’s have been previously developed in many systems, e.g. • TSIMMIS from Stanford • HERMES, IMPACT from UMD • Etc. Inside Story KF Workshop

  41. Valid and Full instance • Intuitively an instance is • valid w.r.t. some source access table if every fact (i.e. every assignment of value to an attribute for an entity) is supported by at least one source. • full when it accumulates all the facts reported by various sources. • NOT ENOUGH. • Generalization needed • Conflict management needed Inside Story KF Workshop

  42. Extraction of attribute values • Web sources • the web is searched for pages related to the entity of interest (a person, a place, or an event) in a specific domain (Greek Mythology, Roman History, …) using a metasearch engine such as Google. • An HTML parser analyzes the pages returned by the search engine and extracts significant pieces of text, taking into account the structure of the page. • A lexical analysis is performed using Wordnet. • The result of this step is a tagged version of the original text, in which each word is labeled with its corresponding part of speech. Inside Story KF Workshop

  43. Extraction of attribute values • An entity detection algorithm recognizes, based on some heuristics we have developed, the names of people, organizations, places, etc occurring in the text. • This algorithm can be trained on large data corpora to acquire a knowledge base that improves its performance. • The algorithm is also capable of recognizing different representations of the same name (e.g. Dr.H.J.Smith, H.J.Smith, Hanan J.Smith) and classifying the names (e.g. Dr.H.J.Smith is a person while Glass Inc. is a company). Inside Story KF Workshop

  44. Extraction of attribute values • Some minor tasks • Pronoun resolution • the issue of mapping a pronoun into an entity named somewhere • word sense disambiguation • Each word may represent different parts of speech and may have several meanings depending on the context • The result of executing these algorithms is a rewritten and unambiguous version of the original text. Inside Story KF Workshop

  45. Extraction of attributes values • A semantic parser applies a set of rules that, based on the structure of sentences, permit us to deduce the entity-attribute-value triples. • Semantic rules are of the form Tail  Head • Tail is a condition to be evaluated on a sentence of words from the text. • If this condition is satisfied, the head says how to extract one or more entity-attribute-value triples from the sentence. • Our system contains over 300 rules. We plan to increase this to around 1000 in the next 3 months. Inside Story KF Workshop

  46. User can cut and paste a sentence and specify the entity, attribute, value in it. STORY learns a more general rule from it. Learned rule KF Workshop

  47. Consider an XML node N=  name,value,{c1,…cn}> where {c1,…cn}are children nodes Assuming that N is a root node in an XML document, and nodes may act both as entities and the attributes…. e is an entity A is an attribute <person> <name> John Doe </name> <height> 170 </height> <eyes> black </eyes> … </person> XML sources Inside Story KF Workshop

  48. GetXMLAttr(N,e,A) GetXMLAttr(N,e,A) • begin \\ • Result :=  • If N.value=e or N.name=e then • for each child c of N such that c.name=A do • Result := Result U {c.value } • end for • else • for each child c of N do • Result := Result U GetXMLAttr(c,e,A) • end for • end if • return Result • end Inside Story KF Workshop

  49. CPR • There are good stories and bad stories • The STORY architecture supports the goals of succinctness and exploration and creates stories with respect to three important parameters: • the priority of the story content, • the continuity of the story, • the non-repetition of facts covered by the story • We want to deliver the most important facts to the intended audience. • So far, we have focused primarily on priority and non-repetition, worrying less about continuity. KF Workshop

  50. CPR examples • In the story of Pentheus, it makes more sense to first say that his parents were Cadmus and Agave, then say he reigned as King of Thebes, and then explain why he was killed. • This rendering of the story is in chronological order, ensuring a kind of temporal continuity. • Other measures of continuity are also possible within the STORY framework. • A repetition function may evaluates how much repetition there is in a given story. • For example, in the case of Pentheus, we may extract the fact that Agave is a parent of Pentheus, and that Agave is the mother of Penthus. Including both these facts in a story is repetitive as the latter fact subsumes the former. KF Workshop

More Related