1 / 107

Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka

This Hadoop Tutorial on Hadoop Interview Questions and Answers ( Hadoop Interview Blog series: https://goo.gl/ndqlss ) will help you to prepare yourself for Big Data and Hadoop interviews. Learn about the most important Hadoop interview questions and answers and know what will set you apart in the interview process. Below are the topics covered in this Hadoop Interview Questions and Answers Tutorial: <br><br>Hadoop Interview Questions on: <br><br>1) Big Data & Hadoop <br>2) HDFS <br>3) MapReduce <br>4) Apache Hive <br>5) Apache Pig <br>6) Apache HBase and Sqoop <br><br>Check our complete Hadoop playlist here: https://goo.gl/4OyoTW <br><br>#HadoopInterviewQuestions #BigDataInterviewQuestions #HadoopInterview

EdurekaIN
Télécharger la présentation

Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  2. Big Data & Hadoop Market  According to Forrester: growth rate of 13% for the next 5 years, than twice w.r.t. predicted general IT growth Hadoop Market which is more  U.S. and International Operations (29%) and Enterprises (27%) lead the adoption of Big Data globally  Asia Pacific to be fastest growing Hadoop market with a CAGR of 59.2 % Companies focusing on improving customer relationships (55%) and making the business more data-focused (53%)  2013 2014 2015 2016 CAGR of 58.2 % EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  3. Hadoop Job Trends EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  4. Agenda for Today Hadoop Interview Questions  Big Data & Hadoop  HDFS  MapReduce  Apache Hive  Apache Pig  Apache HBase and Sqoop EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  5. Big Data & Hadoop Interview Questions “The harder I practice, the luckier I get.” Gary Player EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  6. Big Data & Hadoop Q. What are the five V’s associated with Big Data? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  7. Big Data & Hadoop Q. What are the five V’s associated with Big Data? Big Data EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  8. Big Data & Hadoop Q. Differentiate between structured, semi-structured and unstructured data? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  9. Big Data & Hadoop Q. Differentiate between structured, semi-structured and unstructured data?  Structured  Semi - Structured  Unstructured Organized data format Data schema is fixed Example: RDBMS data, etc.    Partial organized data Lacks formal structure of a data model Example: XML & JSON files, etc. Un-organized data Unknown schema Example: multi - media files, etc.       EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  10. Big Data & Hadoop Q. How Hadoop differs from Traditional Processing System using RDBMS? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  11. Big Data & Hadoop Q. How Hadoop differs from Traditional Processing System using RDBMS? RDBMS Hadoop RDBMS relies on the structured data and the schema of the data is always known. Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured. RDBMS provides limited or no processing capabilities. Hadoop allows us to process the data in distributed parallel fashion. On the contrary, Hadoop follows the schema on read policy. RDBMS is based on ‘schema on write’ where schema validation is done before loading the data. In RDBMS, reads are fast because the schema of the data is already known. The writes are fast in HDFS because no schema validation happens during HDFS write. Suitable for OLTP (Online Transaction Processing) Suitable for OLAP (Online Analytical Processing) Licensed software Hadoop is an open source framework. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  12. Big Data & Hadoop Q. Explain the components of Hadoop and their services. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  13. Big Data & Hadoop Q. Explain the components of Hadoop and their services. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  14. Big Data & Hadoop Q. What are the main Hadoop configuration files? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  15. Big Data & Hadoop Q. What are the main Hadoop configuration files? hadoop-env.sh core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml masters slaves EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  16. HDFS Interview Questions “A person who never made a mistake never tried anything new.” Albert Einstein EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  17. HDFS Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the fault tolerance capability of the system? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  18. HDFS Q. HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the fault tolerance capability of the system? HDFS replicates the blocks and stores on different DataNodes  Default Replication Factor is set to 3  EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  19. HDFS Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this problem. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  20. HDFS Q. What is the problem in having lots of small files in HDFS? Provide one method to overcome this problem. Solution:  Hadoop Archive  It clubs small HDFS files into a single archive Problem:  Too Many Small Files = Too Many Blocks  Too Many Blocks == Too Many Metadata  Managing this huge number of metadata is difficult  Increase in cost of seek HDFS Files (small) .HAR file > hadoop archive –archiveName edureka_archive.har /input/location /output/location EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  21. HDFS Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  22. HDFS Q. Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block? Default Block Size = 128 MB 514 MB / 128 MB = 4.05 == 5 Blocks Replication Factor = 3 Total Blocks = 5 * 3 = 15 Total size = 514 * 3 = 1542 MB      EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  23. HDFS Q. How to copy a file into HDFS with a different block size to that of existing block size configuration? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  24. HDFS Q. How to copy a file into HDFS with a different block size to that of existing block size configuration? Block size: 32 MB = 33554432 Bytes ( Default block size: 128 MB)  Command:  hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /local/test.txt /sample_hdfs Check the block size of test.txt  hadoop fs -stat %o /sample_hdfs/test.txt HDFS Files (existing) test.txt (HDFS) move to HDFS: /sample_hdfs test.txt (local) -Ddfs.blocksize=33554432 32 MB 32 MB 128 MB 128 MB HDFS HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  25. HDFS Q. What is a block scanner in HDFS? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  26. HDFS Q. What is a block scanner in HDFS? Note: This question is generally asked for the position Hadoop Admin Block scanner maintains integrity of the data blocks  It runs periodically on every DataNode to verify whether the data blocks stored are correct or not  Steps: 1. DataNode reports to NameNode 2. NameNode schedules the creation of new replicas using the good replicas 3. Once replication factor (uncorrupted replicas) reaches to the required level, deletion of corrupted blocks takes place EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  27. HDFS Q. Can multiple clients write into an HDFS file concurrently? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  28. HDFS Q. Can multiple clients write into an HDFS file concurrently? HDFS follows Single Writer Multiple Reader Model  Write Read The client which opens a file for writing is granted a lease by the NameNode  NameNode rejects write request of other clients for the file which is currently being written by someone else  HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  29. HDFS Q. What do you mean by the High Availability of a NameNode? How is it achieved? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  30. HDFS Q. What do you mean by the High Availability of a NameNode? How is it achieved? NameNode used to be Single Point of Failure in Hadoop 1.x  High Availability refers to the condition where a NameNode must remain active throughout the cluster  HDFS HA Architecture in Hadoop 2.x allows us to have two NameNode configuration.  in an Active/Passive EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  31. MapReduce Interview Questions “Never tell me the sky’s the limit when there are footprints on the moon.” –Author Unknown EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  32. MapReduce Q. Explain the process of spilling in MapReduce? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  33. MapReduce Q. Explain the process of spilling in MapReduce? The output of a map task is written into a circular memory buffer (RAM).  80% 80% 20 % 50 % Spill data Default Buffer size is set to 100 MB as specified in mapreduce.task.io.sort.mb  RAM Spilling is a process of copying the data from memory buffer to disc after a certain threshold is reached  Local Disc Default spilling threshold is 0.8 as specified in mapreduce.map.sort.spill.percent  Node Manager EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  34. MapReduce Q. What is the difference between blocks, input splits and records? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  35. MapReduce Q. What is the difference between blocks, input splits and records? Blocks Physical Division Blocks: Data in HDFS is physically stored as blocks  Input Splits: Logical chunks of data to be processed by an individual mapper  Input Splits Records: Each input split is comprised of records e.g. in a text file each line is a record  Logical Division Records EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  36. MapReduce Q. What is the role of RecordReader in Hadoop MapReduce? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  37. MapReduce Q. What is the role of RecordReader in Hadoop MapReduce? RecordReader converts the data present in a file into (key, value) pairs suitable for reading by the Mapper task  The RecordReader instance is defined by the Input Format  Key 0 57 122 171 … Value 1 David 2 Cassie 3 Remo 4 Ramesh 1 David 2 Cassie 3 Remo 4 Ramesh … RecordReader Mapper EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  38. MapReduce Q. What is the significance of counters in MapReduce? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  39. MapReduce Q. What is the significance of counters in MapReduce? Used for gathering statistics about the job:  for quality control   for application-level statistics Easier to retrieve counters as compared to log messages for large distributed job  For example: Counting the number of invalid records, etc.  Counter: 0 2 1 +1 1 David 2%^&%d 3 Jeff 4 Shawn 5$*&!#$ MapReduce Output invalid records EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  40. MapReduce Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  41. MapReduce Q. Why the output of map tasks are stored ( spilled ) into local disc and not in HDFS? Reducer Mapper The outputs of map task are the intermediate key-value pairs which is then processed by reducer  Intermediate output is not required after completion of job  Local Disc NodeManager Storing these intermediate output into HDFS and replicating it will create unnecessary overhead.  output HDFS EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  42. MapReduce Q. Define Speculative Execution EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  43. MapReduce Q. Define Speculative Execution slow If a task is detected to be running slower, an equivalent  task MRTask (slow) task is launched so as to maintain the critical path of the progress job Node Manager Scheduler tracks the progress of all the tasks (map and  Scheduler reduce) and launches speculative duplicates for slower tasks MRTask (duplicate) After completion of a task, all running duplicates task are  launch killed speculative Node Manager EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  44. MapReduce Q. How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  45. MapReduce Q. How will you prevent a file from splitting in case you want the whole file to be processed by the same mapper? Method 1: Increase the minimum split size to be larger than the largest file inside the driver section i. conf.set ("mapred.min.split.size", “size_larger_than_file_size"); ii. Input Split Computation Formula - max ( minimumSize, min ( maximumSize, blockSize ) ) Method 2: Modify the InputFormat class that you want to use: i. Subclass the concrete subclass of FileInputFormat and override the isSplitable() method to return false as shown below: public class NonSplittableTextInputFormat extends TextInputFormat { @Override protected boolean isSplitable (JobContext context, Path file) { return false; } } EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  46. MapReduce Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  47. MapReduce Q. Is it legal to set the number of reducer task to zero? Where the output will be stored in this case? HDFS (Input) HDFS (Output) Legal to set the number of reducer task to zero  Map Reduce It is done when there is no need for a reducer like in the cases where inputs needs to be transformed into a particular format, map side join etc.  Reducer set to zero Map outputs is directly stored into the HDFS as specified by the client  HDFS (Input) HDFS (Output) Map Reduce EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  48. MapReduce Q. What is the role of Application Master in a MapReduce Job? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  49. MapReduce Q. What is the role of Application Master in a MapReduce Job? Client RM NM AM submit job Acts as a helper process for ResourceManager  launch AM Initializes the job and track of the job’s progress  Retrieves the input splits computed by the client  ask for resources Negotiates the resources needed for running a job with the ResourceManager  run task Creates a map task object for each split  status unregister EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  50. MapReduce Q. What do you mean by MapReduce task running in uber mode? EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

More Related