1 / 53

excelonlineclasses.co.nr/ excel.onlineclasses@gmail

http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com. Excel Online Classes offers following services :. Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector. Nagarjuna K. HDFS. HDFS .

Télécharger la présentation

excelonlineclasses.co.nr/ excel.onlineclasses@gmail

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. http://www.excelonlineclasses.co.nr/ excel.onlineclasses@gmail.com http://www.excelonlineclasses.co.nr/

  2. Excel Online Classes offers following services: • Online Training • Development • Testing • Job support • Technical Guidance • Job Consultancy • Any needs of IT Sector http://www.excelonlineclasses.co.nr/

  3. Nagarjuna K HDFS http://www.excelonlineclasses.co.nr/

  4. HDFS • Distributed FS designed to run on Commodity Hardware • Provides high throughput access to application data , suitable for applications having large datasets http://www.excelonlineclasses.co.nr/

  5. Assumptions & Goals • Hardware Failure • Streaming Data Access • Large Datasets • Simple coherency Model • Moving Computation cheaper than moving data http://www.excelonlineclasses.co.nr/

  6. Hardware Failure Assumptions & Goals • HDFS instance  many machines • Each storing part of the data • Chances that any machine goes down can’t be avoided • Detection of faults, auto recovery is core architectural goal of HDFS http://www.excelonlineclasses.co.nr/

  7. Streaming Data Access Assumptions & Goals • HDFS is designed fro batch processing rather than interactive usage by users. • Emphasis on Data throughput • Not on low Latency data access. http://www.excelonlineclasses.co.nr/

  8. Streaming Data Access Assumptions & Goals • HDFS built on !dea“Write once , Read many times pattern” • Overtime  data set generated and placed in HDFS • Analysis is done one large part of data , rather than on first few records • Time to read whole data set is more than retrieving first or the last record. http://www.excelonlineclasses.co.nr/

  9. Large Datasets Assumptions & Goals • A typical file ranges from GB to TB http://www.excelonlineclasses.co.nr/

  10. Simple Coherency Model Assumptions & Goals • HDFS built on !dea “Write once , Read many times pattern” • The assumption enables high through put access http://www.excelonlineclasses.co.nr/

  11. Moving Computation OR Data ? Assumptions & Goals • Computation intensive porgraming • Data intensive programing http://www.excelonlineclasses.co.nr/

  12. Where HDFS doesn’t fit • Low latency data access • Lots of small files • Multiple writers, arbitrary file modifications http://www.excelonlineclasses.co.nr/

  13. Where HDFS doesn’t fit • Low latency data access • Lots of small files • High latency time • Each file (say 10 KB of size) takes up a block in HDFS Compress • All the metadata is stored in HDFS memory http://www.excelonlineclasses.co.nr/

  14. Where HDFS doesn’t fit • Multiple writers, arbitrary file modifications • Single user writes files in HDFS. Appending only at the end. Multiple sources of writing into a same file or writing at arbitrary offset is not supported (currently) http://www.excelonlineclasses.co.nr/

  15. Blocks • disc has block size • minimum amount of data that is read/write • 512 bytes • FileSystem blocks are few multiple of disc block size • few KB http://www.excelonlineclasses.co.nr/

  16. Blocks • In classical FS, single block may contain data of only single file • Leads to internal fragmentation. • Newer file systems, solves this problem by • block suballocation • tail merging http://www.excelonlineclasses.co.nr/

  17. Blocks • HDFS also has a block size • 64 MB • Unlike normal FS , if file is less than 64 MB it doesn’t occupy underlying storage of 64MB. http://www.excelonlineclasses.co.nr/

  18. Why BIG BLOCK size ? • Throughput vs Latency • time to seek start of block • Reading the whole block http://www.excelonlineclasses.co.nr/

  19. Why BIG BLOCK size ? • seek time = 10ms • transfer rate (throughput) = 100MBPS • make seek time 1% of transfer rate , • block size = 100MB • Default is 64 MB • As the transfer rate increases , Block size can be increased http://www.excelonlineclasses.co.nr/

  20. hadoopfsck / -files -blocks • Gives information about all the files and blocks in the file system • Replication • under • over etc., • corrupt ? • etc., http://www.excelonlineclasses.co.nr/

  21. HDFS Architecture NS NAME NODE Name Space BLOCK MANAGEMENT Block Storage ….. STORAGE DATA NODE DATA NODE http://www.excelonlineclasses.co.nr/

  22. HDFS Architecture -- NameSpace • Name Space • Consists of dirs, files, blocks • Supports create/ delete/modify/list files or dirs operations NS NAME NODE Name Space BLOCK MANAGEMENT Block Storage ….. STORAGE DATA NODE DATA NODE http://www.excelonlineclasses.co.nr/

  23. HDFS Architecture -- Block Storage • Block Storage • Block Management • Datanode cluster membership • Supports create/delete/modify/get block location o/p • Manages replica and placement • Storage • Provides read and write access to blocks. NS NAME NODE Name Space BLOCK MANAGEMENT Block Storage ….. STORAGE DATA NODE DATA NODE http://www.excelonlineclasses.co.nr/

  24. HDFS Architecture • NameSpace Volume = NameSpace+Blocks • Implemented using NN and DNs • NameNode supports • Name Space • Block Management • Both are collocated in the namenode • DataNodes are used in string the block replicas • Block files are stored on the local file system http://www.excelonlineclasses.co.nr/

  25. Metadata in NameNode http://www.excelonlineclasses.co.nr/

  26. NameNode • Two main storage systems • fsimage • edit logs • New write request • recorded in the edits log • in memory metadata is updated • used to serve read requests http://www.excelonlineclasses.co.nr/

  27. NameNode --fsimage • Serialized form of all the dir& file inodes in the system • iNodes internal representation of file metadata • file replication level • modification/access times • access permissions • block size • blocks a file is made up of http://www.excelonlineclasses.co.nr/

  28. NameNode --fsimage • Doesn’t record datanodes on which blocks are present • NameNode keeps this mapping in memory • NameNode asks datanode for their block lists periodically. • Hence NameNode upto-date  http://www.excelonlineclasses.co.nr/

  29. DataNode • Periodically sends ___ to NameNode • Heart Beat • Block Report http://www.excelonlineclasses.co.nr/

  30. NameNode --EditLogs • Keep on increasing. • So What ? • EditLogs are stored on physical disk  http://www.excelonlineclasses.co.nr/

  31. NameNode --EditLogs http://www.excelonlineclasses.co.nr/

  32. Secondary NameNode • Asks NN for edits and fsimage file • Loads fsimage into memory • Applies each and every operation in edits file onto fsimage and consolidates the fsimage file • Send back this fsimage to NN. http://www.excelonlineclasses.co.nr/

  33. NN & SNN • Thus edits file in NN becomes less • NN doesn’t have the burden of merging the edit logs with existing image http://www.excelonlineclasses.co.nr/

  34. Communication b/w NN and DN/client • DN OR Client connects through configured TCP port of NameNode. • A RPC abstract wraps Clinet/DN protocol. • RPC – Remote Procedure Call http://www.excelonlineclasses.co.nr/

  35. Communication b/w NN and DN/client • Name Node doesn’t initiate any RPC • It just responds to RPC’s http://www.excelonlineclasses.co.nr/

  36. Robustness of HDFS • Data Node Failures, Heart Beat, Replication NN DN http://www.excelonlineclasses.co.nr/

  37. Robustness of HDFS • Cluster Rebalancing • Free Space goes down on once cluster. • High Demand for a particular data http://www.excelonlineclasses.co.nr/

  38. Robustness of HDFS • Data Integrity • CheckSum of data node. • If client doesn’t receive the proper data, client can opt for data from another replica. http://www.excelonlineclasses.co.nr/

  39. Robustness of HDFS • Metadata Disk Failure • NameNode • Secondary NameNode http://www.excelonlineclasses.co.nr/

  40. Data Organization • Data Blocks • Write once / Read many • Apt for Large data sets • chooped into 64 mb blocks, • each block reside on different node if possible http://www.excelonlineclasses.co.nr/

  41. Data Organization • Staging • Client caches the data before writing to block. • NameNode insert file name into its metadata and allocates a block, • Client flushes out that temp data on to a block on the DataNode specified by NN http://www.excelonlineclasses.co.nr/

  42. Data Organization • Staging • Once the file is closed, client informs NN, that no more data is present. • NameNode commits the file creation operation on to persistent store. • If NameNode dies in this process….. ? http://www.excelonlineclasses.co.nr/

  43. Data Organization • Replication Pipelining • The first DataNode receives data from client in small portions (say 4 KB), • writes into its disk and forwards to DN2 • DN2 does the same thing with DN3 which ultimately flushes the data out. http://www.excelonlineclasses.co.nr/

  44. http://www.excelonlineclasses.co.nr/

  45. File Permissions on HDFS • Client’s identity determined • user name and groups from which it operates. • Sharing of FS shouldn’t be used hostile environment • Going forward • Kerberos authentication http://www.excelonlineclasses.co.nr/

  46. Hadoop File Systems • HDFS is just one implementation of Hadoop FileSystems. • org.apache.hadoop.fs.FileSystem • represents a FileSystem in hadoop http://www.excelonlineclasses.co.nr/

  47. Hadoop File Systems http://www.excelonlineclasses.co.nr/

  48. Hadoop File Systems http://www.excelonlineclasses.co.nr/

  49. DFShell The HDFS shell can be invoked by: bin/hadoopdfs <args> • put • rm • rmr • setrep • stat • tail • test • text • cat • chgrp • chmod • chown • copyFromLocal • copyToLocal • cp • du • dus • expunge • get • getmerge • ls • lsr • mkdir • movefromLocal • mv • touchz nagarjuna@outlook.com

  50. Link Files in HDFS • No Hard Links • No Soft Links http://www.excelonlineclasses.co.nr/

More Related