1 / 90

Web Data Warehouse and Change Management *

This research paper discusses the challenges of change management in web data and proposes a web data warehousing system for detecting and representing web deltas. The paper presents the WHOWEDA project as a solution to effective information extraction, management, and processing on the World Wide Web.

smcclean
Télécharger la présentation

Web Data Warehouse and Change Management *

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Data Warehouse and Change Management* Sanjay Kumar Madria Department of Computer Science University of Missouri-Rolla madrias@umr.edu Based on IEEE ICDCS’00 and IEEE TKDE (under minor revision)

  2. Replaces its antecedents leaving no trace!!!! Current Situation of W3 • The Web allows information to change at any time and in any way • Two forms of changes • Existence • Structure and content modification • Leaves no trace of the previous document

  3. Problems of Change Management • Problems: • Detecting, Representing and Querying these changes • The problem is challenging • Typical database approaches to detect changes based on triggering mechanisms are not usable • No access right, no support for triggers • Information sources typical do not keep track of historical information to a format that is accessible to the outside user

  4. Motivating Example • Assume that there is a web site at www.panacea.gov • Provides information related to drugs used for various diseases • Suppose, on 15th January, a user wishes to find out periodically (every 30 days) • information related to side effects and uses of drugs used for various drugs and • changes to these information at the page-level compared to its previous version

  5. Structure of www.panacea.gov • www.panacea.gov contains a list of diseases • Each link of a particular disease points to a web page containing a list of drugs used for prevention and cure of the disease • Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc) • From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug

  6. A Snapshot as on 15th Jan Side effects Indavir Ritonavir Uses AIDS Cancer Heart disease Alzheimer’s Disease Side effects Hirudin Uses Diabetes Niacin Ibuprofen Impotence Side effects Vasomax Side effects Side effects Caverject Uses Uses

  7. Some Changes • 25th January • Links related to Diabetes are removed • New link containing information related to Parkinson’s Disease • Information related to issues, side-effects and uses of various drugs for Cancer are also modified

  8. A Partial Snapshot as on 25th Jan Side effects Tolcapone Parkinson’s Disease Uses update Cancer New Link www.panacea.gov Diabetes Side effects

  9. Some Changes • 30th January • Links related to Impotence is modified • Previously provided by www.pfizer.com • Now by www.panacea.gov • Inter-linked structure of the Web pages related to Caverject is also modified • Information about Viagra, a new drug for Impotence is added

  10. A Partial Snapshot as on 30th Jan Side effects www.panacea.gov Uses Caverject Impotence Side effects Vasomax Viagra Uses

  11. Some Changes • 8th February • Link structure of Heart Disease is modified • Label Heart Disease is modified to Heart Disorder • Content of the pages dealing with side-effects and uses of Hirudin are updated • Inter-linked document structure of Niacin is modified • Web pages related to the side effects and uses of Ibuprofen (Alzheimer’s Disease) are removed

  12. On 8th February www.panacea.gov Heart disorder Alzheimer’s Disease Side effects Hirudin Uses Niacin Side effects

  13. Side effects Uses A Snapshot as on 15th Feb Indavir Ritonavir AIDS Alzheimer’s Disease Cancer Heart disease Parkinson’s Disease Hirudin Niacin Impotence Viagra Vasomax Caverject

  14. Types of Changes • Insert Node • Delete Node • Update Node • Insert Link – same as either Insert node or update node • Delete Link – same as either delete node or update node • Update link – same as update node

  15. Objectives • Web deltas - Changes to web information • Detecting and representing relevant page-level web deltas • changes that are relevant to user’s query, not any arbitrary changes or web deltas • Restricted to page level • Detect those documents • which are added to the site • deleted from the site • those documents which have undergone content or structural modification • How these delta documents are related to one another and with other documents relevant to the user’s query

  16. The WHOWEDA* Project • WHOWEDA: A WareHouse of WEb DAta • To design and implement a web warehousing system capable of effective extraction, management, and processing of information on the World Wide Web • Data model: WHOM (WareHouse Object Model) • * Journal papers in WWW journal’00, DKE’01, TKDE (under review), Computer Journal’00 • * Conference papers in ICDCS’00, ICPADS’00 DOLAP’00, DASFAA’99, DAWAK’99, FODO’98, ER’98, DEXA98, ……. • *Under submission – DKE, DPDS, CJ, ER, ……… • *related activities- Sp. Issues of Journals, ECWEB’00 and 01 • * many grad and undergrads have worked

  17. Related Work • Lore – change management (SIGMOD’97 and ICDE’98) • Contrast • OEM based, not applied on Web • WebCQ (Georgia Tech) • Needs a set of URLs. • No interdocument changes • Htmldiff – • Input - two versions • Output – marked up copy highlight changes • Contrast • Difficult to browse in case of big file • Ours is based on query , not any changes

  18. Change Mgmt in DBMS • Two Approaches • Snapshot collection at times t1, t2,….. • Snapshot deltas, D and Ds at time t1, t2,….. • Contrast – we use snapshot delta approach, but with semi-structured data

  19. Overview of our approach • Step 1: Two snapshots of old and new relevant data coupled from the Web using global web coupling operation and materialized in two web tables. • Step 2: Web join, left outer join and right outer joined operations are performed on these two web tables • Result is joined, left and right outer joined web tables • Step 3: Delta web tables containing different types of web deltas generated from these resultant web tables.

  20. Overview of WHOM • Web warehouse : collection of web tables • Set of web tuples and a set of web schemas represents a web table • Web tuple - directed graph containing nodes and links and satisfies a web schema • Nodes and links contain content, metadata and structural information associated with Web documents and hyperlinks • Tree representation • Web algebra containing web operators to manipulate web tables • Global Coupling, Web Select, Web Join etc.

  21. User WWW Web Query & Display Warehouse Concept Mart Global Web Manipulation Global Web Coupling Global Ranking Pre processing Data Visualization Schema Tightness Web Warehouse Data Visualization Web Union Web Select Web Intersection Web Project Local Web Manipulation Local Web Coupling Schema Tightness Local Ranking Schema Search Web Join Schema Match

  22. Step 1: Retrieving snapshots of Web data using Global Web Coupling

  23. Coupling Query Graph • Directed connected acyclic graph • Consists of nodes, links and keywords imposed on them.

  24. Example • Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at www.panacea.gov • information related to side effects and uses of drugs used for various diseases • Result of the query is stored in the form of web table

  25. Pictorial Representation “side effects” d {1, 6} www.panacea.gov a b “drug list” {1, 3} k “uses”

  26. Coupling Query (Formal) • Set of node variablesXn • Each variable represents set of Web documents • Set of link variablesXl • Each variable represent set of hyperlinks • Set of connectivities C in defined over node and link variables • To specify hyperlink structure of the documents • Set of predicates P defined over some of the node and link variables • Specify metadata, content or structural conditions • Set of coupling query predicates Q • Conditions on execution of the query

  27. Coupling Query • Xn = {a, b, d, k} • Xl = { - } • P = {p1, p2, p3, p4} • p1(a) = METADATA:: a[url] EQUALS “www.panacea.gov” • p2(b) = CONTENT:: b[html.body.title] NON-ATTR-CONT “drug list” • p3(k) = CONTENT:: k[html.body.title] NON-ATTR-CONT “uses” • p4(d) = CONTENT:: d[html.body.title] NON-ATTR-CONT “side effects”

  28. Coupling Query • C = k1 AND k2 AND k3 • k1 = a < - > b • k2 = b < -{1, 6} > d • k3 = b < -{1, 3} > k • Q = {q1} • q1(b) = COUPLING_QUERY:: polling_frequency EQUALS “30 days”

  29. Web Table Drugs (15th Jan) a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 Beta Carotene a0 b1 d2 Cancer k2 a0 b5 d12 Ibuprofen Alzheimer’s Disease k12

  30. a0 b3 d4 k5 Albuterol Diabetes a0 b4 u4 u5 u6 d5 Impotence Vasomax k6 a0 b4 u7 d6 Cavarject Impotence u8 k7 a0 b2 u2 d3 Heart Disease Hirudin k3 Web Table Drugs (15th Jan)

  31. a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 a0 b2 u2 d3 Heart Disorder Hirudin k3 Web Table New Drugs (15th Feb) Beta Carotene a0 b1 d2 Cancer k2

  32. a0 b2 u3 d7 Heart Disorder Niacin k7 a0 b4 u9 d8 Impotence Vasomax k8 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb) a0 b4 u7 d6 Cavarject Impotence k7

  33. a0 b4 u12 d9 Impotence Viagra k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb)

  34. Step 2: Performing Web Join, Left and Right Outer Web Join

  35. Storage of Web Objects • Warehouse Node pool– distinct nodes, each node has node-id, version-id • warehouse document pool • Web table pool • Table node pool- type identifiers for node and link, node-id, link-id, version-id, URL of the node, target node-id, label, and link type of the link • web tuple pool- ids of all the nodes and links belonging to web tuple • web schema pool – store the web schema and coupling query

  36. Web Join • Information composition operator • Combines two web tables into a single web table under certain conditions • Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes • Two nodes are joinable if they are identical • Two nodes are identical if the URL and last modification date of the nodes are same • The joined web tuple is stored in a different web table

  37. Web Join • Join web tables Drugs and New Drugs • Nodes which has not undergone any changes are the joinable nodes in these two web tables. • Content modified nodes, new nodes and deleted nodes cannot be joinable nodes

  38. a0 b0 u0 d0 Indavir (3) AIDS k0 Ritonavir a0 u1 d1 AIDS k1 Joined web table a0 b0 u0 d0 AIDS Indavir (1) AIDS k0 a0 AIDS a0 b0 d1 u1 Ritonavir (2) AIDS a0 k1

  39. a0 b4 u7 d6 Cavarject (5) Impotence u8 k7 a0 b4 u7 Cavarject Impotence Joined Web Table a0 b2 u3 d7 Heart Disorder Niacin (4) k4 a0 u2 d3 Heart Disease Hirudin k3

  40. Joined Table a0 b2 u2 d3 Heart Disease Hirudin (6) k3 Hirudin a0 u2 d3 Heart Disorder k3

  41. a0 b4 u7 d6 Cavarject (5) Impotence u8 k7 a0 b4 u7 Cavarject Impotence Types of web tuples • Web tuples in which all the nodes are joinable • Results of joining two versions of web tuples that has remained unchanged during the transition • Web tuples in which • some of the nodes are joinable nodes • remaining nodes are the result of insertion, deletion or modification operations

  42. a0 b0 u0 d0 Indavir (3) AIDS k0 Ritonavir a0 u1 d1 AIDS k1 Types of web tuples • Tuples in which • Some of the nodes are joinable nodes • Out of the remaining nodes some are result of insertion, deletion or modification and • The remaining ones remained unchanged during the transition

  43. Algorithm for Computing joinable nodes

  44. Algorithm of web join

  45. Algorithm of web join (continued)

  46. Outer Web Join • Web tuples that do not pariticipate in the web join process (dangling web tuples) are absent from the joined web table • Outer web join enables us to identify them • Left outer web join • Right outer web join

  47. a0 b0 u0 d0 Indavir AIDS k0 a0 b0 u1 d1 Ritonavir AIDS k1 a0 b2 u2 d3 Heart Disorder Hirudin k3 Web Table New Drugs (15th Feb) Beta Carotene a0 b1 d2 Cancer k2

  48. a0 b2 u3 d7 Heart Disorder Niacin k7 a0 b4 u9 d8 Impotence Vasomax k8 Web Table New Drugs (15th Feb) a0 b4 u7 d6 Cavarject Impotence k7

  49. a0 b4 u12 d9 Impotence Viagra k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Web Table New Drugs (15th Feb)

  50. a0 b4 u9 d8 Impotence Vasomax k8 Beta Carotene a0 b1 d2 Cancer a0 b4 u12 d9 Impotence Viagra k2 k9 a0 b6 u10 d10 b6 Tolcapone Parkinson’s Disease k10 Right Outer Web Join

More Related