330 likes | 491 Vues
Jesse Yates Salesforce.com. Secondary Indexing. t he discussion so far…. 9/11/12 HBase Pow -wow. What is it?. Problem. HBase rows are multi-dimensional Only sorted on the row key How do you efficiently lookup deeper into the row key?. Example.
E N D
Jesse Yates Salesforce.com Secondary Indexing the discussion so far…. 9/11/12 HBase Pow-wow
Problem • HBase rows are multi-dimensional • Only sorted on the row key • How do you efficiently lookup deeper into the row key?
Example How do we find all people with the last name ‘Ruth’? Full table scan!
Indexing! • Store the property we need to search for as the primary key • pointer back to the primary row • fast lookup - O(lg(n))
Use Cases • Point lookups • Volume of data influences usefulness of index • Let user decide if they need to use an index • Scan lookup • WHERE age > 16
Omid Full transactional support Centralized oracle
Lily WAL implementation on top of HBase 100-500 writes/sec
Percolator Full transactions Distributed, optimistic locking ~10 sec latencies possible
Culvert Async Dead project, incomplete
http://jyates.github.com/2012/07/09/consistent-enough-secondary-indexes.htmlhttp://jyates.github.com/2012/07/09/consistent-enough-secondary-indexes.html Client-side coordinated index Use timestamps to coordinate Not yet implemented
Trend Micro Implementation Still just POC ???
Solr/Lucene Standard Lucene library bolted on HBase Not commonly used Lots of formats/codecs already written
Considerations for HBase What do we need to do?
Built-in vs. external library vs. semi-supported (e.g. security)
Which should I use?? • HBase experts write a single ‘right’ impl • Officially endorse a ‘correct’ version • What changes do we need to make • How close to the core is the project • Written in everywhere • hbase-index module • External library
Key Observation “Secondary indexing is inherently an easier problem than full transactions… secondary index updates are idempotent.” - Lars Hofhansl
Async vs. Synchronous vs.Transactional • We don’t need full transactions • Transactions are slow • Transactions fail with increasing probability as number of servers increases • Optionally async or sync • Async • Inherently ‘dirty’ index • How does index cleanup work? • Inherently different for each type
Where’s my data? • Extra columns vs. index table • HBase Region-pinning • Has to be best-effort or will decrease availability • Helps minimize RPC overhead • Cross-table region-pinning • Needs a coprocessor hook to be useful • HDFS block allocation • Keep index and data blocks on same HDFS node
How much data are we talking? “Seems like there are 3 categories of sparseness: • sparse indexes (like ipAddress) where a per-table approach is more efficient for reads • dense indexes (like eventType) where there are likely values of every index key on each region • very dense indexes (like male/female) where you should just be doing a table scan anyway” - Matt Corgan (9/10/12)
Impact on implementation • Need a lot of knowledge of data to pick the right kind of index • User knows their data, let them do the hard work of picking indexes
Everyone’s got an impl already • We need to make HBase flexible enough to support (most) current indexing formats with minimal overhead for switching • Lucene style Codec/CodecProvider?
What should it look like? • Minimal changes to the top-level interfaces • Add a single new flag? • Configuration based? • Enough that the user gets to be smart about what should be used • We can’t get all cases right – just provide building blocks • Automatically use an index? • Scanner/Filter style use?
Properties for the client • Should the user even see the index lookups? • ACID? • Ordering of results? • Support the current sorted order? • Batch lookup? • Implications on current features • Replication • splitting
Schema(less) • Schema enforced? • Rigid usage of index matching an expected schema? • Schema table? Reserved schema columns?.META.? • Schema-less • Let the user apply whatever they think and use only what actually works • Best-effort • Use client-hinted schema and try to apply all the known indexes
My random thoughts…. • Client-side managed indexes are efficient • Minimal RPC overhead • Cleanup is async to client and rarely misses • Solves the cross-region/server problem • Region-pinning is a nice-to-have optimization • Scales without concern for locality • Flexible enough to support custom codecs • Can be built to provide server-side optimizations • Locality aware indexes to minimize RPCs