300 likes | 468 Vues
Bellman: A Data Quality Browser. Tamraparni Dasu Theodore Johnson S. Muthukrishnan Vladislav Shkapenyuk Amit Marathe Contact: marathe@research.att.com. Exploring Large and Complex Databases. AT&T has a lot of large databases.
E N D
Bellman: A Data Quality Browser Tamraparni Dasu Theodore Johnson S. Muthukrishnan Vladislav Shkapenyuk Amit Marathe Contact: marathe@research.att.com
Exploring Large and Complex Databases • AT&T has a lot of large databases. • Many types of offerings (business/consumer/hosting, voice: LD/local, data: frame/ip/vpn/voip, etc). • Many aspects of providing a service (sales/provisioning/billing/network maintenance/customer care). • Many databases, each with 100’s to 1000’s of tables. • These databases are complex • Legacy systems and procedures (AT&T is an old company). • AT&T’s scale requires local autonomy. • Different conventions and encodings in different domains • Complex interdependencies. • A user’s order touches many databases at some point or other. • A customer may order many types of services, etc.
Our Problem • We provide special consulting services to AT&T Business units. • Data mining studies. • Targeted data cleanup. • Help with large scale database integration and cleanup efforts. • “Special services” means that we are always looking at new data sets. • ODBC access to databases. • Web scraping • Delimited ascii dumps. • We need help …
What is Bellman? • A “data quality browser” • A platform for integrating both simple and advanced data browsing tools. • A platform for integrating data browsing and data cleanup tools. • A platform for browsing and correlating multiple disparate databases and data sets. • Current status: the platform and the tools are in good shape, the UI and integration could use some work.
Database Profiling • Collect summaries of the tables, fields, and data in the databases. • Store in a supplementary database. • Use these summaries to • Supplement database browsing. • Answer sophisticated queries about database structure. • Support data mining on the database structure.
Bellman History • Java-Bellman (first version) • Java/JDBC client software. • Collect profiles using C++/ODBC • Oracle, Sybase, Teradata, Informix, SQL Server • Problem : installation is difficult, especially to configure the ODBC drivers. • Web-Bellman (current version) • ASP .NET/C# front end, C++ profiling. • No client software installation required. • Centralized management of ODBC drivers and some advanced tools.
Bellman Screen Shots • Bellman is best shown through some examples. • Problem: • Its only interesting with real databases. • But the data is AT&T Proprietary • We’ll do our best to show Bellman without getting ourselves fired.
Development site Bellman is the browser Spider is an application of our approximate string matching tools.
Database Connections Ted’s account is configured to access these databases. Profile queries are restricted to them.
Find Similar • What other fields in the database contain similar data? • Many values in common (exact match) : resemblance • “Textually similar” • Many q-grams in common • Distribution of q-grams is similar. • The fields have a similar name. • Uses • Finding join paths • Finding join paths which require field transforms • Finding other fields with names/IDs/etc.
Algorithms • Field similarity by exact match • Min Hash signatures to compute resemblance • Field similarity by substring similarity • Resemblance of q-grams • Distribution of q-grams • Efficient key finding
Set Resemblance • The resemblance of two sets A, B is, |A • After some algebra, we find that |A = ,(|A|+|B|)/(1+ ,) • There are fast algorithms to compute … • Application to Bellman • The sets in question are the unique values in the fields • The resemblance of two fields is a good measure of their similarity.
Min Hash Signatures A m(A)<m(B) • h(a) is a hash function. • m(A) = min(h(a), a in A). • Observation:Pr[m(A) = m(B)] = , • Why? • Universe restricted to AB • Equality only in AB • Repeat the experiment many times. • Signature:(m1(A), m2(A), …, mk(A)) ignored m(A)=m(B) m(A)>m(B) B
Min Hash Implementation • (Schema, Table, Field) maps to unique ID • 256 hash functions • Bellman_Field_Sig table has the schema (ID, HashNum, MinHash) • MinHash is a 32-bit integer (collisions become an issue for large sets)
Min Hash Implementation (cont.) • SQL to find fields similar to field w/ ID 23 select S2.ID, count(*) from Bellman_Field_Sig S1, Bellman_Field_Sig S2 where S1.ID=23 and S1.HashNum = S2.HashNum and S1.MinHash=S2.MinHash group by S2.ID order by count(*) desc
Textual similarity • Q-gram : collection of consecutive q-letter substrings. • Example: 3-grams of “Bellman” • Bel, ell, llm, lma, man • Q-gram resemblance : compute signatures of the set of q-grams of a field instead of field values. • Q-gram similarity : L2 distance of the frequency distribution of the q-grams of a field. • Use sketches to store a compact approximate summary of the q-gram distribution.
Q-gram Sketches • V is a d dimensional vector • Sk(V) is the k dimensional vector (V.X1, V.X2, …, V.Xk) where Xi are random d dimensional vectors • Typically, k << d • L2(V1, V2) is approximated by L2(Sk(V1), Sk(V2))
Q-gram Sketch Implementation • 256 random sketch vectors • Bellman_Field_QSketch (ID, SkNum, SkVal) • Tuples correspnding to a particular ID constitute the normalized sketch vector for that field • SkVal has datatype float
Q-gram Sketch Implementation • SQL to find fields similar to field w/ ID 23 select S2.ID, sum((S1.SkVal – S2.SkVal) ** 2) from Bellman_Field_QSketch S1, Bellman_Field_QSketch S2 where S1.ID = 23 and S1.SkVal = S2.SkVal group by S2.ID order by sum((S1.SkVal – S2.SkVal) ** 2) desc
Finding Keys • Key a (minimal) set of fields whose composite value is unique in every record. • Problem: finding all keys of length up to k in a table with F fields requires O(Fk) expensive count distinct queries. • Solution: • Eliminate “bad” fields : floats, mostly NULL, etc. • Collect an in-memory sample • Can store hashes of long strings. • Compute count distinct queries on the sample • Verify keys by query against database • Use Tane-style level-wise pruning.
Finding Keys (cont.) • Upto 4-way keys found by Bellman • Candidate-1 keys are all the “good” fields • In iteration i+1, consider the cross product of candidate-i keys and candidate-1 keys • Classify each set of i+1 fields so obtained as an approximate key, candidate-(i+1) key or a functional dependency. • Thresholds: key_threshold, min_improvement
Reference • Mining Database Structure: Or, How to Build a Data Quality Browser – Dasu, Johnson, Muthukrishnan, Shkapenyuk, ACM SIGMOD 2002.