Analysis and Evaluation of a Native XML Database

Analysis and Evaluation of a Native XML Database Kenneth H. Wenker, Ph.D. May 19, 2003 Dr. C. Edward Chow, Advisor University of Colorado, Colorado Springs

Outline of the Presentation 1. The nature of this masters project 2. XML and databases 3. NeoCore XML database architecture 4. Testing architecture 5. Testing results 6. Conclusion

PART 1: The nature of this masters project

Goals of the Master Project • To analyze and explain some significant architectural features of the NeoCore XML database, specifically those that have been patented and are therefore public • To test the performance of those features as they have been implemented in v. 2.6 for Windows • To create benchmarks for evaluating the XML database performance

PART 2: XML and databases

What is an XML Database? • information = data + context • THE CHALLENGE: with XML, not only the data, but also the context, can vary. XML is inherently extensible. • A traditional RMDBS is not inherently extensible. • An “XML-enabled” database is an RMDBS which has been modified in an attempt to handle XML extensibility. • XML-enabled databases need to deal with both storing and reconstructing the tag structure. • A “native XML database” is specifically designed to process XML documents efficiently • Tamino: must provide some sort of structure definition • dbXML: weak doing updates and inserts

USES FOR AN XML DATABASE • e-commerce—since we are frequently using XML to transport data, it would be efficient to store it that way, provided we can store it, manipulate it, and retrieve it easily. • When the tag structures are frequently and unpredictably changing, as in, for example, DNA research. • When the tag structures are more important than the data, as in (perhaps) a next-generation search engine.

XML Database Benchmarks • No authoritative group has recognized any benchmark for XML databases • There are five academic benchmarks • The Michigan Benchmark (Univ. of Michigan) • XOO7 (Nat’l Univ. of Singapore) • XBench (Univ. of Waterloo, Canada) • XMach1 (Univ. of Leipzig) • XMark (National Research Institute for Mathematics and Computer Science in the Netherlands) • The Michigan Benchmark is designed to allow developers to tune databases. The others are usable for this project. • None of them provides an adequate test of extensibility, in my opinion. The extensibility they allow is within the bounds of predefined schema.

XOO7 BENCHMARK • Document Structure: <Module> <Manual>Text of Configurable Length</Manual> <ComplexAssembly> <ComplexAssembly> <BaseAssembly> <CompositePart> <Document>Text of Configurable Length</Document> <Connection> <AtomicPart> • Typical Configuration File: NumAssmPerAssm 3NumCompPerAssm 3NumCompPerModule 50NumAssmLevels 5NumAtomicPerComp 200NumConnPerAtomic 3DocumentSize 1000ManualSize 3800

PART 3: NeoCore XML database architecture

CLIENT SERVER Console (runs on client’s browser) Used for administration Used for manual input NeoCore XMS HTTP DB User Application API C++, Java, VB, .COM, or create your own HTTP HTTP CLIENT/SERVER OVERVIEW

Icon Gene-rator Set of Indices Map Store Tag Dictionary Data Dictionary Figure 2-2. Overview of NeoCore Database Architecture OVERVIEW OF NEOCORE STORAGE ARCHITECTURE

THE “ICON” • It is a transform of a string of indefinite length to a n-bit binary number. • Similar to a hash or CRC code. • The width of the icon is typically 64 bits. The first part of the icon, depending on the number of rows in the index, is used to specify which row of an index is to be accessed. The rest is used as a “confirmer” to be explained shortly. • NeoCore uses a patented technique to create the icon: to create a 64-bit icon you would access four 256-row tables and XOR the resultant values. The technique allows processing 32 bits at a time from the input string.

AN EXAMPLE: 1. The XML Document <my-phone-book> <entry> <name>me</name> <nmbr>6198654227</nmbr> </entry> </my-phone-book>

AN EXAMPLE: 2. The Data Dictionary   me  6198654227                                                                                                                                                                                          

AN EXAMPLE: 3. The Tag Dictionary   my-phone-book>entry>name>   entry>name>   name>   my-phone-book>entry>nmbr>   entry>nmbr>   nmbr>                                                                                                                                                                                                                                                                                                                                                                               

AN EXAMPLE: 4. The Map Store

AN EXAMPLE: 5. An Empty Data Core Index 1000 0111 0110 0101 0100 0011 0010 0001 0000

AN EXAMPLE: 6. The Data Core Index, 1 entry 1000 0111 0110 0101 0100 0011 0010 0001 0000

AN EXAMPLE: 7. The Data Core Index, 2 entries 1000 0111 0110 0101 0100 0011 0010 0001 0000

Dupe Index Map Store Data Dictionary Icon Gen-erator Index “Smith” “Smith” Figure 6-2. Pointers in the Duplicate Index for a Specific Data Item A TYPICAL QUERY for $a in document(“myPhoneBook.xml”)//[last=“Smith”] return count($a) “last>Smith”

KEY ARCHITECTURAL INNOVATIONS • Separate Tag Store and Data Store • No-string Map Store • Multi-Table Icon Generator • 1.5 Look-ups Core Indices • Use of Duplicate Indices

PART 4: Testing architecture

TESTING PHILOSOPHY • Interested only in the core database, not in the broader XMS • For this round of testing, did not evaluate the ability to handle huge amounts of data • A good test of the NeoCore architecture requires testing it with several different kinds of XML documents, as shown on the next page.

DOCUMENT TYPES USED

CLIENT SERVER Console NeoCore XMS Docu- ment Gene- ration Query Gene- ration HTTP KornShell Tools DB OS File System NeoCore API Query Tool HTTP Store Tool TESTING ARCHITECTURE Client and server were run on the same Dell 350 platform, using Windows XP Professional, a 3.0 GHz CPU, and 1.50 GB of RAM.

NINE TESTS 1. Disk space used 2. Flattening & restoring 3. Storage speed 4. Query accuracy 5. Query speed 6. Full core-index check 7. Extensibility test 8. Insertion test 9. Non-indexed string searches

PART 5: Testing results

DISK STORAGE USED Original XOO7 Benchmark Mean Results: approx 3.0:1 NeoCore XOO7 Benchmark Results: approx 2.5:1

Storage Time • Mean time to store a “small3.config” XOO7 Document was slightly over 21 seconds. • The XOO7 team reported the time to prepare or convert the XML document so that it could be stored. The mean time for a “small3.config” document was about 5 minutes. They did not report how much time it subsequently took to store the document. • Our testing: Dell 350, Windows XP Professional, 3GHz CPU, 1.5GB RAM. XOO7 original testing: SunOS 5.7 Unix, 333MHz, 256MB RAM

QUERY TIME

FINDINGS--STRENGTHS • Handles extensibility well—no difference between extensibility documents and XOO7 documents • Indexed searches very fast (if enough RAM)—almost 300 times faster than the original XOO7 findings for some queries (although some of that difference is due to difference in platform capabilities) • Disk requirements about the same as needed by the XOO7 team just for their conversion documents • Efficiency remains even when the core indices are full—effective collision management

FINDINGS--WEAKNESSES • In some cases, we do not get out exactly what was put in: • White-Space inaccuracies • Some dropped comments • Some DOCTYPE information is dropped or enclosed in database-created “<prolog>” tags • Duplicate indices can be a bottleneck • Core indices do not scale easily if expansion is needed

PART 6: Conclusion

FUTURE RESEARCH NEEDS • Conduct this testing on a more heavy-duty platform using more robust data sets. • Test future releases—3.0 to have significant changes. • Analyze the patent for new duplicate indices once it is published on the USPTO site. • Publish an article on the (currently unimplemented) non-indexed substring search algorithm. • Do the same testing with NeoCore competitors, if they will support it (Tamino would not). • Publish related articles in IT journals.

Conclusion • On each key performance indicator the NeoCore XMS is as good or better than the initial results reported by the XOO7 team. • It handles the extensibility of XML well—its primary strength. • To realize full performance, run with enough accessible RAM to hold all indices and still have plenty of room for buffers. • The NeoCore database remains a work in progress: significant improvements are being planned for future releases. • Acceptance likely to depend more on the broader management system than on the database at its heart.

Analysis and Evaluation of a Native XML Database

Analysis and Evaluation of a Native XML Database

Presentation Transcript

Native XML Database for Information Systems

Database evaluation:

SQL/XML, XQuery , and Native XML Programming Languages

XML Storage and Indexing Native XML

An RDF and XML Database

Native XML Database for Information Systems

Rainbow: XML and Relational Database Design, Implementation, Test, and Evaluation

Database Systems and XML

The NATIVE XML Server

Rainbow: XML and Relational Database Design, Implementation, Test, and Evaluation

IMS XML Database and XQuery

IMS XML Database and XQuery

Storing XML using native storage

Sedna: A Native XML DBMS

Native XML Databases

XML Storage and Indexing Native XML

Native Names Database

XML Native Query Processing

USE OF JAVA APPLICATION AND XML DATABASE FOR EMERGY EVALUATION OF AGRICULGURAL SYSTEMS

TIMBER A Native XML Database

XML 과 Database

Native XML Databases