Hadoop Interview Questions & Answers

Latest Hadoop Interview Questions & Answers (2025 Update)

What are the key differences between Hadoop 1 and Hadoop 2?
Hadoop 1 had a single JobTracker, which limited scalability. Hadoop 2 introduced YARN, separating resource management from job scheduling. This allowed multiple frameworks like Spark and Tez to run on the same cluster. Hadoop 2 also supports high availability for the NameNode.

How do you optimize a MapReduce job?
You can optimize by tuning the number of mappers and reducers. Use a Combiner to reduce shuffle data. Write efficient map and reduce logic. Monitor counters and execution time to detect bottlenecks. Compress intermediate data to speed up transfers.

Explain the concept of data locality in Hadoop.
Data locality means moving computation closer to the data. Hadoop tries to run tasks on the same node where the data block exists. This reduces network traffic and speeds up processing.

How do you handle small files in HDFS?
HDFS is not ideal for storing many small files. Combine small files using Hadoop Archive (HAR) or SequenceFile. This reduces NameNode metadata load and improves performance.

What are speculative execution and its benefits in Hadoop?
Speculative execution runs duplicate tasks on slower nodes. The first to finish is accepted. This helps when tasks run slower due to hardware issues, improving overall job completion time.

What is YARN, and how does it differ from the original Hadoop architecture?
YARN separates resource management from job scheduling. ResourceManager handles cluster resources, and NodeManagers run on each node. This architecture enables multiple processing frameworks to run on the same Hadoop cluster.

What is the function of the NameNode and DataNode in HDFS?
The NameNode manages the file system namespace and metadata. DataNodes store the actual data blocks and report their status to the NameNode.

How do you plan capacity for a new Hadoop cluster? (Admin)
Capacity planning involves estimating data growth rate, retention period, and replication factor. Calculate storage by multiplying raw data size by replication factor (typically 3). Add 20-30% buffer for intermediate data.

What are rack awareness and its importance in Hadoop? (Admin)
Rack awareness is Hadoop’s understanding of the cluster’s network topology. Hadoop places replicas on different racks to survive rack failures. Configure rack awareness in core-site.xml using topology scripts.

What is the use of Combiner in MapReduce? (Developer)
A Combiner is a mini-reducer that runs on mapper output before data is sent to the reducer. It performs local aggregation, reducing shuffle data volume. Combiners work best for associative and commutative operations.

What is the main difference between Hadoop MapReduce and Apache Spark?
Spark processes data in-memory while MapReduce writes intermediate results to disk. This makes Spark significantly faster for iterative algorithms. Spark provides unified libraries for SQL, streaming, ML, and graph processing.

What is the role of Secondary NameNode?
The Secondary NameNode is NOT a backup. It performs periodic checkpointing by merging FSImage and EditLogs. This reduces NameNode startup time by preventing EditLogs from growing too large.

Leave a Reply Cancel reply