Introduction to Apache Hadoop HDFS

Apache Hadoop HDFS • What is it ? • What is it for ? • Architecture • Resilience • Administration • Data access • Future changes ?

HDFS – What is it ? • HDSF = Hadoop Distributed File System • It is a distributed file system • Runs on low cost hardware • It is open source • Written in Java • Fault tolerant • Designed for very large data sets • Tuned for high throughput

HDFS – What is it for ? • Designed for batch processing • Streaming access to data • Large data sizes i.e. Terabytes • Highly reliable using data replication • Supports very large node clusters • Supports large files • Supports file numbers into millions

HDFS – Architecture

HDFS – Architecture • Has a master / slave architecture • A master NameNode • Controls file system operations • Maps data blocks to DataNodes • Logs all changes • Slave DataNodes • Store file blocks • Store replicated data

HDFS – Resilience • Data is replicated across DataNodes • Nodes may fail but data is still available • DataNodes indicate state via heart beat report • Single point of failure in master NameNode • Data integrity via check sums

HDFS – Administration • Access via Java API • FS Shell commands language • HTTP browser • C wrapper for Java API • Space reclamation • Via control of replication factor • Deleted files sent to trash folder • Trash folder cleaned after configurable time

HDFS – Future changes Things they might consider for HDFS • File append • User quotas • File links • Stand by nodes

Other Areas • Want to know about ? • Big Data • Nutch • Solr • see my other presentations

Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

Introduction to Apache Hadoop HDFS