1 / 61

Extracting and Delivering Stories from Heterogeneous Information Sources

Extracting and Delivering Stories from Heterogeneous Information Sources. V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello Univ. of Napoli, Italy. Talk Outline. Motivating examples STORY Architecture Theoretical Model Algorithms OptStory

jase
Télécharger la présentation

Extracting and Delivering Stories from Heterogeneous Information Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting and Delivering Stories from Heterogeneous Information Sources V.S. Subrahmanian, M. Fayzullin University of Maryland M. Albanese, C. Cesarano, A. Picariello Univ. of Napoli, Italy

  2. Talk Outline • Motivating examples • STORY Architecture • Theoretical Model • Algorithms • OptStory • DynStory • GenStory • Experimental results JIKD

  3. Motivating example: Pakistani Nuclear Scientists • Nuclear proliferation is the issue of the day • Complex web of • Nuclear scientists • Personnel at weapons locations • Arms dealers • Customs officials • Shipping companies • Front companies • Manufacturers • … • Nuclear monitors may want the “story” on any person or place or event to decide if further investigation is warranted. Huge amounts of data need to be processed and filtered so that only the relevant data is shown to the analyst. JIKD

  4. Motivating example: US Immigration • Customs official sees a traveller. • Wants the quick story on him • Where does he work? • Who does he work for? • What is his area of expertise? • Any warrants? • Is he on a watch list? • Who are his associates – anyone suspicious? • Just the right data should be presented to him. JIKD

  5. A motivating example: Pompeii • Pompeii is a spectacular archaeological site. • Visitor experience can be greatly improved by: • Automatically notifying visitors of interesting phenomena without posting extra signs • Allowing visitors to explore the stories of various monuments, paintings, sculptures, etc. in Pompeii. • Allowing visitors to explore the stories of the characters, events and places depicted in these monuments, paintings, sculptures, etc. • Visitors interests vary – so information about exhibits must adapt in real time to their interests to enhance the experience of the visitor. JIKD

  6. Pompeii Visitors Visitor arrives at ticket counter and buys ticket. JIKD

  7. Pompeii Visitors ANALOG: Soldier in Baghdad sets out on a mission. Visitor arrives at ticket counter and buys ticket. JIKD

  8. Pompeii Visitors Ticket agent asks if they would like to use the story facility and if they would like to use their cell phone and/ or PDA to get stories of interest to them. JIKD

  9. Pompeii Visitors ANALOG: Soldier in Baghdad chooses to receive stories on his radio or PDA. Ticket agent asks if they would like to use the story facility and if they would like to use their cell phone and/ or PDA to get stories of interest to them. JIKD

  10. Pompeii Visitors As visitor walks through Pompeii, STORY identifies where he is and predicts where he might go in the future (probabilistically). Ex. if he is at location L, it might predict that he will go to the House of the Vetti. JIKD

  11. Pompeii Visitors ANALOG: As soldier drives through Baghdad, STORY identifies where he is and correlates where he will go with his route plan. As visitor walks through Pompeii, STORY identifies where he is and predicts where he might go in the future (probabilistically). Ex. if he is at location L, it might predict that he will go to the House of the Vetti. JIKD

  12. Pompeii Visitors See items You are here (Triclinium in the House of the Vetti) Based on this prediction of where he might go in future, it identifies potential stories he might be interested in and downloads parts of these stories to his PDA/cell. E.g. It might download stories about Pentheus. JIKD

  13. Pompeii Visitors ANALOG: STORY finds stories satisfying the soldier’s conditions of interest and downloads them to his PDA or to the nearest radio broadcast location. See items You are here (Triclinium in the House of the Vetti) Based on this prediction of where he might go in future, it identifies potential stories he might be interested in and downloads parts of these stories to his PDA/cell. E.g. It might download stories about Pentheus. JIKD

  14. Pompeii Visitors The visitor chooses which story he is interested in. STORY dynamically generates the story and delivers it to the user’s PDA/cell phone, e.g. user might choose story of Pentheus. JIKD

  15. Pompeii Visitors ANALOG: STORY delivers the story to the soldier. He can then further interact with the story if needed using voice and cursor prompts. The visitor chooses which story he is interested in. STORY dynamically generates the story and delivers it to the user’s PDA/cell phone, e.g. user might choose story of Pentheus. JIKD

  16. Pompeii Visitors The user can choose to explore the story in greater detail (e.g. if he is seeing the story of Pentheus, he can also explore the story of Agave). JIKD

  17. Stories depend upon context • The concept of story is dramatically different for the examples mentioned earlier. • Pompeii Visitor cares about mythological, historical, artistic facts. • Soldier in Baghdad cares about security and mission related facts. Who are the people around me and not who is depicted on the walls. • Nuclear analyst cares about the nuclear networks – who is selling what to whom? Who is moving the money? What front companies are involved? • What goes into a story depends not only on basic facts about entity of interest but also on the application domain and specific items of interest to the user. JIKD

  18. STORY Architecture JIKD

  19. Consist of 3 parts An entity An attribute A value STORY also allows time-stamped values. attributes to have set-valued types. Example: Attribute: mother, Value: Agave Attribute: cartag, Value: AMD 124 Attribute: employers, Value = {ibm, hp } RDF Triples JIKD

  20. Consist of 3 parts An entity An attribute A value STORY also allows time-stamped values. attributes to have set-valued types. Time Varying Attribute (TVA) Example: attribute: job Value = { (cardinal, 1500,1509), (pope,1510,1545) } Example: Attribute: worked-for Value = {(ibm,1990,1998), (hp,1999,2004)} RDF Triples JIKD

  21. Story Schema Unlike DBs, no need to declare schema in advance. • A story schema is a pair (E,A) • Examples • Set of entities in Pompeii: • Set of all objects in Pompei • Set of all objects and events depicted • Any entities related to the previous categories. • Set of all people/organizations associated with Iraqi cars • Set of all car ids • Set of owners of such cars • Set of people associated with such owners via one or many links. JIKD

  22. Story Instance Not all attribute values needed for all entities. • An instance w.r.t. story schema (E,A) is a partial mapping • Input: • an entity of E and an attribute of A • Output: • a value v in dom(A) if A is an ordinary attribute, or • a timevalue if A is a TVA JIKD

  23. Extracting RDF from text • Text needs to be parsed in order to understand its structure before extracting RDF triples • Context free grammars to parse the text • A set of template-based rules to extract triples from parsed text • Rule can be derived from examples JIKD

  24. Generating rules from examples Validate and define extraction patterns (see next slide) Rome is the capital of Italy Syntactic parsing Manually mark nodes corresponding to entities, attributes and values. Add alternatives for constant tokens (e.g. of | in) JIKD

  25. Generating rules from examples Each extraction patterns define which marked node acts as the entity, which one as the attribute and which one as the value. JIKD

  26. Generating rules from examples The same node may act as the entity w.r.t. an extraction pattern, and as the value w.r.t. another extraction pattern. JIKD

  27. Triples extraction • Each sentence is parsed, generating one or more parse trees. • Each parse tree is matched against the parse tree that represents an extraction rule using a tree matching algorithm. • If the match succeeds, the pieces of information corresponding to the marked template nodes are extracted and triples are built according to the extraction patterns. Probabilistic tree matching Algorithms in progress JIKD

  28. Example “Iran is one of the most dangerous enemies of the United States” JIKD

  29. Example “Iran is one of the most dangerous enemies of the United States” • Allows 4 different interpretations, corresponding to different parse trees. • All of the 4 parsing trees match the template • 2 of them allow us to extract the triple: • E=“the most dangerous enemies of the United States” • A=“one” • V=“Iran” • 2 of them allow to extract the triple: • E=“the United States” • A=“one of the most dangerous enemies” • V=“Iran” JIKD

  30. Example “Hu Jintao is the most popular leader in China” JIKD

  31. Example “Hu Jintao is the most popular leader in China” • Allows 2 different interpretations, corresponding to different parse trees. • The first parse tree doesn’t match the template • The second parse tree matches the template and allows us to extract the triple: • E=“China” • A=“the most popular leader ” • V=“Hu Jintao” JIKD

  32. How the system works • The story application developer first specifies a set of data sources that are to be accessed, e.g. • www • a relational database • an object oriented database • database of web documents • a set of URLs • Some combination of the above. • The STORY crawler extracts a full instance. • Set of triples obtained from all sources specified by the user. • Full instances don’t resolve inconsistencies, generalize data, etc. • Stories are then created on demand using the full instance and using appropriate conflict resolution, generalization, and other modules. JIKD

  33. XML sources • Consider an XML node • N=  name,value,{c1,…cn}> where {c1,…cn}are children nodes • Assuming that N is a root node in an XML document, and nodes may act both as entities and the attributes…. • e is an entity • A is an attribute JIKD

  34. GetXMLAttr(N,e,A) GetXMLAttr(N,e,A) • begin \\ • Result :=  • If N.value=e or N.name=e then • for each child c of N such that c.name=A do • Result := Result U {c.value } • end for • else • for each child c of N do • Result := Result U GetXMLAttr(c,e,A) • end for • end if • return Result • end JIKD

  35. CPR • There are good stories and bad stories • The STORY architecture supports the goals of succinctness and exploration and creates stories with respect to three important parameters: • the priority of the story content, • the continuity of the story, • the non-repetition of facts covered by the story • We want to deliver the most important facts to the intended audience. • So far, we have focused primarily on priority and non-repetition, worrying less about continuity. JIKD

  36. CPR examples • In the story of Pentheus, it makes more sense to first say that his parents were Cadmus and Agave, then say he reigned as King of Thebes, and then explain why he was killed. • This rendering of the story is in chronological order, ensuring a kind of temporal continuity. • Other measures of continuity are also possible within the STORY framework. • A repetition function may evaluates how much repetition there is in a given story. • For example, in the case of Pentheus, we may extract the fact that Agave is a parent of Pentheus, and that Agave is the mother of Penthus. Including both these facts in a story is repetitive as the latter fact subsumes the former. JIKD

  37. Story evaluation function • eval(S)=. (s)+. (s) - . (s) • , ,  are arbitrary functions from the set of all possible stories S about some entities to [0,1] •  describes whether high priority facts are included in the story. • For example, the fact that Pentheus' mother was Agave is more important than the length of Pentheus' big toe. •  describes how continuous the story is. • This means that a story should not jump wildly from one fact to another. •  describes repetition. • clearly, stories that repeat the same or similar facts over and over again leave much to be desired. JIKD

  38. CPR functions • There are many ways of defining how continuous a story is, how repetitive a story is, etc. • Our story creation algorithms can work with any continuity, priority and repetition functions whatsoever. JIKD

  39. Attribute Hierarchy • The attributes of interest are arranged in an attribute hierarchy where attributes can be labeled with priorities. • The story application developer can browse and edit this hierarchy (for example if he wishes to add new attributes). • He can add priorities to selected items in the hierarchy (all sub elements of a given element in the hierarchy will inherit the priority value for the parent unless otherwise stated). JIKD

  40. JIKD

  41. Conflict Management • As multiple data sources may be used to extract attributes, conflicts might occur. • For example, one source may say that Pentheus‘ mother is Agave, while another may say it is Hera. • STORY allows conflict resolution with an application specific method. • Conflicts do not always need to be resolved. Sometimes, you just report the existence of a conflict, and specify what should be reported. JIKD

  42. Conflict Management Policy • Temporal Conflict Resolution • Suppose different data sources provide different values v1, …, vn. Suppose value vi was inserted into the data source at time ti. In this case, we pick the value vi such that ti = max{ t1,t2, …,tn}. If multiple exist, one is selected randomly. • Source based conflict resolution. • The developer of a story may assign a credibility ci to each source si that provides a value vi for attribute A of entity e. This strategy picks value vi such that ci = max {c1,…, cn}. If multiple exist, one is selected randomly. • Voting based conflict resolution. • Each value vi returned by at least one data source has a vote that represents the number of sources that return value vi. In this case, this conflict resolution strategy returns the value with the highest vote. If multiple vi's have the same highest vote, one is picked randomly and returned. JIKD

  43. Generalization Module • Goal: to generalize multiple RDF triples into one. • For example, if we know that Pentheus's father is Cadmus, and his mother is Agave, we may want to generalize this to say that Pentheus's parents are Cadmus and Agave. • If Pentheus was king of one town for some period, king of another town for another period of time, and so on, we may merely want to say that Pentheus was king of many places. • The Generalization Module looks at the RDF-triples stored in the RDF database and augments it with triples that include generalization attributes • … that succinctly summarize a set of less general (i.e. more specific) attributes. JIKD

  44. Generalized Story Schema • A generalized story schema consists of a regular story schema, a function that associates an equivalence relation with each attribute domain and a function that associates a generalization function with each attribute domain. • An equivalence relation on the domain dom(A) of attribute $A$ specifies when certain values in the domain are considered equivalent. For example, we may consider string values “king” and “monarch” to be equivalent in dom(occupation). • For a time varying attribute we may consider (“king“”,L,U) and “monarch”,L',U' to be equivalent independently of whether L=L and U=U' is true or not. • Our system uses WordNet and some heuristics to infer equivalence relationships between terms. • Generalization currently being plugged into the system. JIKD

  45. STORY creation • Construct a story of length k or less from the RDF database. • examining all triples in the RDF entity of interest, • including triples extracted from the data sources by the attribute extractor as well as triples created by the generalization module. • It then finds the k triples that optimize an objective function. • The objective function must be monotonic in priority of the triples and monotonic w.r.t. the continuity function selected by the STORY application developer, and anti-monotonic in the amount of repetition between tuples. JIKD

  46. Closed Instance • We first compute the full instance associated with our source access table. • We then split this instance into equivalence classes using equivalence relation. • Suppose the equivalence classes thus generated are X1, …, Xn. • For each equivalence class Xi we compute the generalization vi using the generalization function associated with attribute A. We insert the tuple (e,A, vi) into the full instance. • This process is repeated for all entities e and all attributes A • After all tuples of the form shown above inserted into the full instance, it becomes the closed instance. JIKD

  47. Story Computation Problem • Given a closed instance I, a positive integer k, and an entity e as input, find a story of size  k that maximizes the value of a given evaluation function eval. • In this case, the found story is called on Optimal Story. • Theorem: Finding an optimal story is NP-hard (even after the full instance is created). JIKD

  48. Story Algorithms • OptSTORY algorithm: finds the story that optimizes the objective function. • This algorithm has the disadvantage of being very slow. • Multiple alternative BestSTORY algorithms • DynStory(S) uses a dynamic programming approach • GenStory(S) which is based on genetic programming. • DynStory and GenStory find suboptimal stories, but do so very fast. JIKD

  49. GPS Support SubsystemCurrent implementation • Outdoor positioning at Pompeii implemented using DGPS • Mobile devices are equipped with IEEE 802.11b wireless Ethernet to allow internet connection JIKD

  50. GIS Support SubsystemOutdoor and indoor positioning • Outdoor positioning • GPS has been successfully adopted in a lot of applications • Indoor positioning • GPS receivers are blind in indoor spaces • Different kinds of positioning systems will be used • Infrared or ultrasound sensors • Radio Frequency sensors • WLAN-based positioning • We have methods to optimally position a set of sensors to monitor the site, but the system is not yet implemented. JIKD

More Related