HDFS – Hadoop Distributed File System modeled on Google GFS. • Hadoop MapReduce – Similar to Google MapReduce • Hbase – Similar to Google Bigtable
Master: hadoop01.cselabs.umn.edu • Slaves: hadoop02 – hadoop05.cselabs.umn.edu • You will require cselabs account to access this cluster. You can login to any of these machines from any cs/cselabs machine.
Data is divided into various tables • Table is composed of columns, columns are grouped into column-families
Partitioning • A table is horizontally partitioned into regions, each region is composed of sequential range of keys • Each region is managed by a RegionServer, a single RegionServer may hold multiple regions • Persistence and data availability • HBase stores its data in HDFS, it doesn't replicate RegionServers and relies on HDFS replication for data availability. • Region data is cached in-memory • Updates and reads are served from in-memory cache (MemStore) • MemStore is flushed periodically to HDFS • Write Ahead Log (stored in HDFS) is used for durability of updates
HBase shell provides interactive commands for manipulating database • Create/delete tables • Insert/update/read from tables • Manage regions
Hbase provides single row atomic operations • CheckAndPut – Similar to test-and-set • CheckAndDelete • All row operations are atomic no matter how many columns are involved. • Hbase also provides row level exclusive locks • You can use these locks to implement single row level transactions
HBase stores multiple versions of a column in a row. Each version is identified by a integer timestamp • By default system time is used as version timestamps. However user can specify a logical timestamp for versioning • Each update to a row creates a new version, for the specified column. • A version can be accessed or deleted using its timestamp. HBase allows to obtain list of all the versions.
Hadoop Home - http://hadoop.apache.org/ • Hbase - http://hbase.apache.org/ • API • http://hbase.apache.org/apidocs/ • http://hadoop.apache.org/