Efficient Maintenance of Semistructured Schema

Efficient Maintenance of Semistructured Schema Katsaros Dimitrios Aristotle University of Thessaloniki Hellas

Introduction (1/3) • Semistructured data • Sources:HTML, BibTeX, SGML, etc. • Characteristics:no rigid structure, but some implicit structure, i.e., “schema” • Knowledge of the “schema” is crucial: • Querying/browsing information sources • Building indexes/views • Storage in relational/object-oriented databases • Query processing

OEM db Movie Movie Movie &1 &2 &3 Review Title Director Title Director Title Director Award Name Nationality Nationality Name Nationality Name Biography Introduction (2/3) Figure 1: Semistructured “movie” objects

Introduction (3/3) • Discovering the common “schema” • Large volume / Irregularity of data • Solution: Mining the “schema” • Scalable / Can deal with irregularity • Association rules proposed by Wang & Liu [6] • Issue: How to deal with dynamic data ?

Motivation Our contributions • Maintenance of the discovered schema under insertions of new objects • Schema for the new objects. • Performance evaluation of the method.

Presentation Outline • Problem definition • Algorithm’s description • Performance evaluation • Conclusion • References

Object Exchange Model • An Object Exchange Model (OEM) object • Identifier o (i.e., &o) • Value • Atomic (integer, float, string) • Complex • List: l1:&o1, l2:&o2, …, lk:&ok • Bag: {l1:&o1, l2:&o2, …, lk:&ok} where: li are labels (“roles”) ? denotes the wild card matching any label  is the nil structure that contains no label

Tree-Expressions Definition • The nil structure is a tree-expression • Let tei be tree-expressions of objects oi. If val(o)= l1:&o1, l2:&o2, …, lk:&ok and i1, i2, …, lr is a subsequence of 1, 2, …, k then li1:tei1, li2:tei2, …, lir:teir is a tree-expression of object o. Representation A tree-expression li1:tei1, li2:tei2, …, lir:teir consists of k subtrees teij each being labeled lij.

Incremental Schema Mining Problem definition Input • A collection of transaction objects in an OEM graph, denoted as DB • A minimum support threshold MINSUP • The frequent tree expressions for DB • A number of new objects added into the collection, denoted as db The incremental schema maintenance problem is to discover all tree expressions which have support in DB  db greater than or equal to MINSUP.

DeltaSSD • DeltaSSD utilizes Negative Borders Definition [Negative Border] Given a collection of S  P(R) of tree expressions, closed with respect to the “weaker than” relation [6], the negative border Bd- of S consists of the minimal tree expressions X  R not in S.

DeltaSSD (notation)

DeltaSSD

Experimental settings Generation of synthetic data • One dataset : • (L1, N1) = (25, 1000) • (L2, N2, T2, I2, P2) = (25, 1000, 4, 2, 50) • (N3, T3, I3, P3) = (3000, 4, 2, 50) • Relatively small database, 3000 objects. • Short and “bushy” transactions (thus, few database scans).

Performance Evaluation Database scans

Performance Evaluation Operations (CPU time)

Conclusions • DeltaSSD is very efficient in terms of database scans • DeltaSSD incurs excessive processing in terms of tree matchings • Re-computing the frequent tree-expressions is inefficient • Future work includes: • Investigation of the complete closure approach • Techniques to reduce the processing cost of tree matching

References • Y. Aumann, R. Feldman, O. Liphstat and H. Mannila, "Borders: An Efficient Algorithm for Association Generation in Dynamic Databases", Journal of Intelligent Information Systems, vol. 12, no. 1, pp. 61-73, 1999. • R. Feldman, Y. Aumann, A. Amir and Mannila, H., "Efficient algorithms for discovering frequent sets in incremental databases", Proceedings of the ACM Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD'97), 1997. • H. Mannila and H. Toivonen, "Levelwise Search and Borders of Theories in Knowledge Discovery", Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 241-258, 1997. • V. Pudi and J. Haritsa, "Quantifying the utility of the past in mining large databases", Information Systems, vol. 25, no. 5, pp. 323-343, 2000. • S. Thomas, S. Bodagala, K. Alsabti and S. Ranka, "An efficient algorithm for the incremental updation of association rules in large databases", Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'97), pp. 263-266, 1997. • K. Wang and H. Liu,"Discovering Structural Association of Semistructured Data", IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 353-371, 2000. • A. Zhou, Jinwen, S. Zhou and Z. Tian, "Incremental Mining of Schema for Semistructured Data", Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'99), pp. 159-168, 1999.

Efficient Maintenance of Semistructured Schema