200 likes | 330 Vues
This paper explores the concept of approximate validity in the context of XML streaming data sources. It presents algorithms for making approximate decisions that are both correct and robust, utilizing statistics-based computations. The study highlights generalized statistics on trees and property testing techniques, including approximate satisfiability and equivalence in XML. An efficient streaming algorithm is proposed for approximating statistics matrices while minimizing memory usage. The findings have implications for data exchange and integration in real-time data processing applications.
E N D
Approximate Validity of XML Streaming Data HUANG Cheng LI Jun University Paris-Sud & Huazhong University of Science and Technologies Michel DE ROUGEMONT University Paris II
Motivation • Streaming Data from different sources • Approximate decisions • Correct • Robust • Statistics based computations
Plan • Generalized Statistics On Trees • Statistics allow Approximate validity on words and trees based on Property Testing (Edit Distance with Moves) • Property testing for regular tree languages (ICALP 2004) (.pdf), • Approximate Satisfiability and Equivalence (.pdf) (LICS 06) • Approximate validity on Streaming data
Edit Distances with Moves • Classical Edit Distance: Insertions, Deletions, Modifications • Edit Distance with moves 0111000011110011001 0111011110000011001 • Edit Distance with Moves generalizes to Ordered Trees
Statistics on words (k-gram) • word W,length n, n-k+1 blocks, of length k=1/ε • For W=001010101110 k=2, n-k+1=11,
Statistics for unranked ordered trees Transformation: Rabin Encoding a a b b b b d b d d b d d d d d Unranked tree Extended 2-ranked tree
Statistics on Trees: generalized k-gram b w a w a a w a We abbreviate “author” by a, “db” by b , “work” by w Types of Sub-paths 00 01 10 11 00 01 10 11
2. Approximate validity based on Property Testing Let F be a property on a class K of structures U An ε -tester for F is a probabilistic algorithm A such that: • If U |= F, A accepts • If U is ε far from F, A rejects with high probability A property F is testable if there exists a probabilistic algorithm A s.t. • For all ε it is anε -tester for F • Time(A) independent of n= |U| . Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994 O. Goldreich, S. Goldwasser and D. Ron, Property Testing and its connection to Learning and Approximation, 1996. Tester usually implies a linear time corrector. (ε1, ε2)-Tolerant Tester.
Regular membership on words H={u.stat(w) : w in r } is a union of polytopes. 2 Polytopes for r. Y(w) Membership Tester:
3. Streaming Data The goal: Decide if a given XML file is -valid for a DTD Our work: Propose an algorithm to get a statistic matrix sustat(t), which approximates the matrix ustat(t) ,using constant space
Data structure for Streaming Data a Stream:<a><b></b><c><g></g><h></h></c><d><i></i><j><k></k></j></d><e></e><f></f></a> a b b c c d e f d j h i e h i g g k j f k Data Structure
Unbounded data structures a Stream:<a><b></b><c><g></g><h></h></c><d><i></i><j><k></k></j></d><e></e><f></f></a> b a c b c e f d g d h i e j h i j g f k k
Bounded data Structure a b c d e h i Suppose the length of the queues is limited to 4 g j f k Some of the matrix entries will be missing constant
Streaming algorithm • Definition: a k-fork is a node with 2 distinct paths of length more than 2k. • Streaming algorithm: • Input: <a>bounded push/update sustat(t) </a>pop/recover/update sustat(t) • Output: matrix sustat(t) k=3 Entries missed: b-f-d… Entries recovered: d-c-d…
Streaming algorithm • Key Lemma : #forks • Theorem :sustat(t) approximates ustat(t) If Memory=2*k,
Approximate validity on streaming data Streaming test(Memory = 2*k): Y(t) ustat(t) sustat(t) DTD
Results: http://www.up2.fr/xmlstream/ Gstat(t) XML file source : Xmark-- http://monetdb.cwi.nl/xml/
Results: http://www.up2.fr/xmlstream/ Lstat(t) XML file source : Xmark-- http://monetdb.cwi.nl/xml/
Conclusion • Statistics of trees: • Generalization of a k-gram • Easy to compute on a DOM • Approximate statistics on Streaming Data • Approximate validity • Data Exchange • Data Integration