Universal, Composable Indexing Queries, Text, Spatial Data, XML Structure, and XML Semantics
Universal, Composable Indexing Queries, Text, Spatial Data, XML Structure, and XML Semantics. Chris Biow, Federal CTO Wayne Feick, Lead Engineer Mark Logic. Universal Indexing: Agenda. MarkLogic Server Text XML Structure XML Semantics Scalars and aggregation Spatial location
Universal, Composable Indexing Queries, Text, Spatial Data, XML Structure, and XML Semantics
E N D
Presentation Transcript
Universal, Composable IndexingQueries, Text, Spatial Data, XML Structure, and XML Semantics Chris Biow, Federal CTO Wayne Feick, Lead Engineer Mark Logic
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
MarkLogic XML Server DBMS Search Application Server
What is MarkLogic Server? • An XML Server • A hybrid (integrated parts) • Special purpose DBMS for XML, with enterprise expectations: • ACID transactions • DBA, backup, replication • Search engine kernel, with enterprise expectations: • Full text • Faceted navigation, at massive scale • Boolean, proximity, stemming, tokenization, decompounding, case, diacritics • Application Server • HTTP • XCC Java/.NET • WebDav
MarkLogic as Special DBMS • Not relational (RDBMS) • Schema agnostic • Text a first-class citizen among data types • XQuery (SQL) • Search engine algorithms for many DB queries • Order(1) in number of docs • O(log(n)) for range indexing • O(n) in results • Very low DBA overhead (0.5 FTE / 100 hosts) • 5-minute install • 5-minute scale-out • Database and search engine are the same
MarkLogic as Special Search Engine timeline • Understands document structure • Transactional: high CRUD load • Unicode • Diacritics • Collations • Languages • Holds the documents • Reindexing • Delivery • Geospatial: Box, Point/radius, polygon • Alerting: Profile, alert, filter, tipping, selector, “trigger,” … • Analytics: Facets, Co-occurrence, word lexicons, … • Everything composes (e.g. geo-alerting, geo-text-data, search-alerting) • Processing near the data • Relational joins and inferencing • Database and search engine are the same message message @id 3 @id 5 status status Testing XProc Oh boy… Element node Attribute Node Text Node
MarkLogic as Special App Server • Native HTTP(S) server • RESTfulXML by default • Transform to HTML • Transform to PDF, MS Office, etc. • PKI with no dependencies • Optionally with external Auth • HTTP(S) client • XCC Java / .NET server • Similar to JDBC / ADO.NET • WebDAV • Folder on the user’s desktop RESTful Architecture user/ Representations + get .json.xqy .xml.xqy + … user.xqy + Get + Put + Post + Delete Resources notes/ URL Rewriter note.xqy Routes.xqy
MarkLogic at Scale • Scale up: typically 1TB or more per server • Scale out: low hundreds ++ • Commodity hardware • Typically ~$5K HW budget per server • 2-CPU x 4-core • 32+ GB RAM • 3x disk: local mount with failover • OS • Linux RHEL 5 • Solaris 10 • Windows 2003/8 (XP/Vista/7 for dev)
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Full-text Indexing (Search) Find all documents that contain the phrase “uniquely identify” 129 122 127 123 130 126 124 125 0 0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 1 can uniquely identify however
Full-text Indexing (Search) Find all documents that contain the phrase “uniquely identify” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . “which uniquely” . . . 126, 130, 167, 212, 219, 377 . . . “identify each” . . .
Word Lexicons • Revertible index of words • Word-level search (wildcarding) • Corpus analysis • Most common words • Document-frequency of words • Result-set intersection • Most common words in result set • Document-frequency of chosen words in result
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
XML Structure Find all articles that have an abstract <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
XML Structure Find all articles that have an abstract UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . .
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
XML Semantics Find all documents that mention the product “IMS” <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
XML Semantics Find all documents that mention the product “IMS” UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “uniquely identify” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, 212, 219, 377 . . . <article>/<abstract> . . . <product>IMS</product>
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Range Indexing Bi-directionally index ordered values (YEAR:int, Volume:text) to document (fragment) IDs UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . YEAR “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, … <article>/<abstract> . . . <product>IMS</product> Volume
Range indexes / lexicons • Scalar or ordered discrete • XS data types • Inequalities • Range queries • Aggregations • Corpus or Result-specific • Discrete • Bucketed (query-time bounds) • Faceted Navigation • Fundamentally O(log(n)) • Memory-mapped
Aggregate Queries How many of the articles that contain “data base” were written in each of the last 5 decades? <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
Aggregates How many of the articles that contain “data base” were written in each of the last 5 decades? UNIVERSAL INDEX “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . YEAR “identify” 123, 126, 130, 142, 143, 167 . . . Document References “each” 123, 130, 131, 135, 162, 177 . . . “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . 126, 130, 167, … <article>/<abstract> . . . <product>IMS</product> Volume
Composition Count all articles that contain “data” in the title and mention the product IMS in a section, grouping by year. <article> <title>A Relational Model of Data for Large Shared Data Banks</title> <author><first-name>Edgar</first-name><last-name>Codd</last-name></author> <abstract> Future users of data banks must be protected from having to know how the data is organized in the machine (the internal representation). . . . Changes in data representation will often be needed . . . </abstract> <body> <section> <section> … has values which uniquely identify each element ... </section> </section> <section>… version of <product>IMS</product> provides the user . . . </section> </body> <metadata><vol>13</vol><number>6</number><year>1970</year></metadata> </article>
The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . . Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . <article>/<abstract> . . . 126, 130, 167, … <product>IMS</product>
Back to Discrete Indexing • Directories • Exclusive, hierarchical, analogous to file system, map to URI • Collections • Set-based, N:N relationship • Security • Invisible to your app
The Universal Index Range Indexes UNIVERSAL INDEX Term Term List “which” 123, 127, 129, 152, 344, 791 . . . “uniquely” 122, 125, 126, 129, 130, 167 . . . “identify” 123, 126, 130, 142, 143, 167 . . . “each” 123, 130, 131, 135, 162, 177 . . . Document References “data base” 126, 130, 167, 212, 219, 377 . . . <article> . . . <article>/<abstract> . . . 126, 130, 167, … <product>IMS</product> Directory(“/articles/”) Collection(Red) Role:Editor + Action:Read
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Scalar Pairs • Data examples • Latitude / Longitude • Any other pair (e.g. volume / closing price) • Query types • Point (exact value) • Point-Radius (circle) • Lat/Lon bound (Mercator “rectangle”) • Polygon (10K+ vertices) • Aggregations: co-occurrence • Discrete facets (symptom : treatment) • Bucketed continuous • Frequency-ordered • Heat Map
Geospatial Indexing Points ordered in latitude major order; special scan operators apply geospatial query constraints GEOSPATIAL INDEX 130 0, -124 123 -10,10.5 126 127 0,0 -10,10.5 167 0,0 126 0,-167 ... 130 0, -29 113 0,-12 126 0,0 167 0,0 113 10.1, 35.553 …
Query Registration • Canonicalize a cts:query() • Hash for an ID • Resolve the query • Persist the term list in memory • Reuse as materialized sub-query • (AKA Topics, Concepts, Macros, etc.)
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Query Indexing • “Alerting” • Selectors, tippers, standing query, filters, “triggers*”, real-time search, content-based routing, stream DBMS, etc. • Search(query, Index) -> docs • Alert(doc, Index) -> queries • Common sub-expression factoring • O(1) in number of alerts • O(n) in results returned * MarkLogic has pre- and post-commit DBMS triggers, which are unrelated Alert New patient matching your study profile
The Reverse Index REVERSE INDEX Document References Query Unified Expression Trees year >= 1970 and (“data” and “size”) “data” and and 437 year > 2003 and (“data” and “web”) “size” year < 2000 and (“data” and “web”) “web” and and 562 (2000 <= year <= 2010) and “web” and 597 . . . year >= 1970 and 623 year < 2000 year >= 2000 and year > 2003 year <= 2010
Alerting in Composition • Scalar • Search is on scalar data with range queries • Alert on range data with scalar reverse-queries • Geospatial • Search on point data with box, circle, polygon query constraint • Compose with text, structure, semantic query • Alert on box, circle, polygon data with point reverse-query • Compose with text, structure, semantic data • Search • Queries compose with Boolean operations (AND, OR, NOT, ((())) • Reverse- and forward-query compose (AND, OR, NOT, ((())) • Why would you ever want to do that?
Universal Indexing: Agenda • MarkLogic Server • Text • XML Structure • XML Semantics • Scalars and aggregation • Spatial location • Alerting: query indexing • Search/query composed with Alerting
Search Composed with Alerting • In Soviet Russia, the document queries YOU! • If you are also a document • Documents are XML • Elements • Typed data • Text • Serialized XQuery against documents • Attributes • Typed data • Composition • XQuery incorporating document and Booleans
Matchmaking • Constraints upon each other. • Matching pairs or one-to-many • Suitable date • man / woman • mixed pool • Carpool ride: driver / rider • Employment: job / resume • Medication: patient / drug • Search security: document / user • Battle: target / shooter
Carpool • Driver • Non-smoking woman driving from San Ramon to San Carlos, leaving at 8AM, listens to rock, pop, hip-hop, wants $10 for gas • Requires female passenger within five miles of start and end • Passenger • Woman will pay up to $20 • From: 3001 Summit View Dr, San Ramon, CA 94582 • To: 300 Oracle Pkwy, Redwood City, CA 94065 • Requires non-smoking car • Won’t listen to country music
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } ...
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) ...
Driver let $from := cts:point(37.751658,-121.898387) (: San Ramon :) let $to := cts:point(37.507363, -122.247119) (: San Carlos :) return xdmp:document-insert( "/driver.xml", <driver> <from>{$from}</from> <to>{$to}</to> <when>2010-01-20T08:00:00-08:00</when> <gender>female</gender> <smoke>no</smoke> <music>rock, pop, hip-hop</music> <cost>10</cost> <preferences> {cts:and-query(( cts:element-value-query(xs:QName("gender"), "female"), cts:element-geospatial-query(xs:QName("from"), cts:circle(5, $from)), cts:element-geospatial-query(xs:QName("to"), cts:circle(5, $to)) )) } </preferences> </driver>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)
Driver xdmp:document-insert( "/passenger.xml", <passenger> <from>37.739976,-121.915821</from> <to>37.53107,-122.264122</to> <gender>female</gender> <preferences> { cts:and-query(( cts:not-query(cts:element-word-query( xs:QName("music"), "country")), cts:element-range-query(xs:QName("cost"), "<=", 20), cts:element-value-query(xs:QName("smoke"), "no"), cts:element-value-query(xs:QName("gender"), "female") )) } </preferences> </passenger>)