Multi-dimensional Index on Hadoop Distributed File System

Multi-dimensional Index on HadoopDistributed File System Haojun Liao, Jizhong Han and Jinyun Fang - Vikas Gonti

Contents • Introduction • Design Overview • Details of Implementation • Experimental Evaluation • Conclusion and Future work

Introduction • Why we need to improve the query performance? Large Voluminous Data Cost Scalability Spatial access methods.

HDFS • HDFS is an open source implementation of DFS

Index Structure in HDFS • HDFS offers an effective way of manipulating and maintenance index as in the single processor environment. • No significant modifications of the original index structure are required to enable its appropriate function. • Query processing using index can be integrated with MapReduce framework towards query efficiency. • Answering queries by using access methods can be more efficient than by using sequential scanning.

Design Overview • R-Tree like Index Structure a disk-based hierarchical structure based on B-tree, is one of the most common used multi-dimensional index. R-Tree node is implemented as a disk page. Index nodes are of the same size in a index.

Structure of Index Frame

Query Processing • Only a small portion of index nodes are involved in handling high selectivity query types, i.e., point query, range query, and nearest neighbor query.

Details of Implementation • Buffer Management • Node Size • Index Structure • Query Optimization issues • Data Transfer Model

Details of Implementation • Buffer Management Different strategies are applied to internal nodes and leaf nodes of index. • Internal nodes accounts for small fraction of total space. • All the internal nodes are kept in buffer once loaded. • Access frequently. In Contrast, • Only a certain number of buffer pages are allocated for leaf nodes. • Relatively large space. • Visited on demand. • More buffer for leaf nodes will decrease the disk access and reduce the response time.

Buffer Management (cont..) • Buffer strategy • Internal nodes are pinned in buffer once loaded for future node access. • Leaf nodes are allocated limited number of buffer pages averagely distributed in each data nodes that are managed by LRU policy • Data transferring procedure • Once the data request emerges, data node check its buffer first. • If the required data packet is hold in the buffer, it is sent to client. • Otherwise, disk access is invoked.

Node Size • Many factors are involved in the determination of index node size, such as transfer overhead, I/O costs and CPU time. • Small Index nodes incurs more data transmission times. • Large index node will be used in order to reduce the data transfer costs. • Large node requires more data to be processed in query • The size of the index node should be aligned to the size of data packet, the minimum unit for data transferring in HDFC. • Avoid the transmitting of the unnecessary data.

Index Structure. • Meta-data locates in the front of file, accounting for 1kb disk space. • Internal nodes, as well as metadata, need to fit in the first data chunk. • The rest space of the first data chunk of the index file is left blank for the extension of internal nodes.

Index Structure (cont..) • Leaf nodes are grouped in several data chunks, according to the location proximity. • We align the next leaf node to the start position of the next data chunk. • In internal node, the entry information includes the sub-tree identifications and the MBR • In leaf node, entry information includes the MBR information and pointer to corresponding data object.

Query Optimization issues • Ordered entries can speed up the in-memory search procedure by reducing computing needs. • Once the entries of each index nodes is sorted according to some spatial criterion, the order can be well preserved in a static environment. The point and range query processing within each node costs O(n/2+w) time, where n is the number of entries in each node and w is the cardinality of the object set.

Data Transfer Model • Data node pushes data to client in the form of data packet of 64k by default to obtain the best performance for sequential reads. • We implement a new data transfer protocol to facilitate the random reads for block-based index structure. • The main difference with PUSH model is that Data node is blocked after transmitting one data packet. • Data node is blocked in favor of the random access where the client might not need the sequential data packets.

Data Transfer Model (Cont..) Our Transfer Model Original Push Model

Experimental Evaluations • Datasets The following real world datasets are used for the experimentation. • CAR contains 2,249,727 road segments extracted from Tiger/Line datasets • HYD contains 40,995,718 line segments representing rivers of China and • TLK contains up to 157,425,887 points extracted from the elevation data of China.

Data transfer overhead • Transferred data is compared with the required data by varying size and count of read operations on HYD data set.

Performance • This result is based on TLK dataset. The data packet during transfer ranges from 8k to 64. • This new protocol offers the best performance for all access patterns when the data packet is set to 64k.

Effects of index node size • Response time of range query and point query on CAR dataset. • The worst performance of range query is when the index node size is set 16k, and the performance degenerates rapidly with the increase of query window. • The response time for point query is longest when index node size is set 8k because the index has more level and more node visits are involved with smaller node size.

Effects of Buffer • Vary the buffer size to evaluate the effects of buffer on query performance, whereas other parameters are kept constant. • We perform 100 range queries of 1% of total space on TLK dataset. • The response time decreases as the available buffer increases. The further improvement of response time can be expected with the increase of buffer size.

Conclusion & Future work • Conclusion • We propose a method for organizing hierarchical structures applied to both B-tree and R-tree on HDFS. • We investigate several systematic parameters like node size, index distribution, buffer, and query processing techniques • Data transfer protocol specified for block-wise random reads integrate with HDFS • Future work • Investigate the problem of combination of MapReduce and index structure. • Explore efficient multi-dimensional data distribution strategy according to index structure to further enhance the I/O performance

Questions ? -- Thank You

Multi-dimensional Index on Hadoop Distributed File System

Multi-dimensional Index on Hadoop Distributed File System

Presentation Transcript

Hadoop File System

MapReduce and Hadoop Distributed File System

Hadoop Distributed File System Architecture and Design

15-440, Hadoop Distributed File System Allison Naaktgeboren

Hadoop Distributed File System

Hadoop Distributed File System

The Hadoop Distributed File System

HDFS ( Hadoop Distributed File System)

Multi-Dimensional Poverty Index

15-440, Hadoop Distributed File System Allison Naaktgeboren

Distributed File System

Hadoop File System

File Processing : Multi-dimensional Index

Hadoop Distributed File System Usage in USCMS

HDFS Hadoop Distributed File System

MapReduce and Hadoop Distributed File System

The Hadoop Distributed File System

What is HDFS | Hadoop Distributed File System | Edureka

Big-data Computing: Hadoop Distributed File System

MapReduce and Hadoop Distributed File System

File Processing : Multi-dimensional Index

Hadoop File System