Select Hadoop: Hadoop Admin Interview Question 2

1. Can you describe your Hadoop journey and current profile and roles and responsibility.

2. What is NameNode Heap memory and how we can configure NameNode heap memory.
Ans: HADOOP_NAMENODE_OPTS="-Xmx500m" will set it to 500MB. The "OPTS" here refers to JVM options. -Xmx is a common JVM option to set the maximum heap.
NameNode heap size depends on many factors such as the number of files, the number of blocks, and the load on the system.

3. How we decide the heap memory limit for a Hadoop Service.
Ans: http://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html
In this example, memory is estimated by considering the capacity of a cluster. Values are rounded. Both clusters physically store 4800 TB, or approximately 36 million block files (at the default block size). Replication determines how many namespace blocks represent these block files.

Cluster A: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=1
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 1 = 128 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 128 MB = 36,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster A needs 36 GB of maximum heap space.

Cluster B: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=3
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 3 = 384 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 384 MB = 12,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster B needs 12 GB of maximum heap space.

Both Cluster A and Cluster B store the same number of block files. In Cluster A, however, each block file is unique and represented by one block on the NameNode; in Cluster B, only one-third are unique and two-thirds are replicas.

4. How we decide the heap memory limit for NameNode.
Ans: If we have to increase NameNode heap memory, How we will increase it.
we can not increase the heap memory while running time.

5. What is the use of Standby NameNode in a High availability Hadoop cluster.
Ans: In a High availability, if the connectivity between Active NameNode and Standby NameNode has been lost then what will be the impact on Hadoop cluster.

6. Will the standby NameNode try to become active?

7. In a Hadoop cluster few machines hardware quality is very low. What will be the impact on the job which is running on that machines?

8. What will be the impact on overall cluster performance?

9. What is the difference between dead node and blacklist node?
Ans: When the JobTracker submits jobs to the TaskTracker and the tasks on that node have failed too many times, the JobTracker will blacklisted a TaskTracker.
Dead Node , which are not in cluster or configure but not showing into the cluster

10. When and how a Hadoop cluster make a node as blacklist node?
Ans: When the JobTracker submits jobs to the TaskTracker and the tasks on that node have failed too many times, the JobTracker will blacklisted a TaskTracker.

11. How NameNode decide that a a Node is dead?
Ans: NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead

12. What is speculative execution. What is the benefit of it?

13. How jobs are get scheduled on a Hadoop cluster?
Ans: 1. The client application submits a job to the resource manager.
2. The resource manager takes the job from the job queue and allocates it to an application master. It also manages and monitors resource allocations to each application master and container on the data nodes.
3. The application master divides the job in to tasks and allocates it to each data node.
4. On each data node, a Node manager manages the containers in which the tasks run.
5. The application master will ask the resource manager to allocate more resource to particular containers, if necessary.
6. The application master will keep the resource manager informed as to the status of the jobs allocated to it, and the resource manager will keep the client application informed.

14. Which MapReduce version is configured in your Hadoop cluster?
Ans: mapreduce v2(yarn)

15. What is difference between MapReduce version one and MapReduce version two?

16. How you will identify a Long running job in Hadoop cluster. How you will troubleshoot the long running job?
Ans: This is an open ended question and the interviewer is trying to see the level of hands-on experience you have in solving production issues. Use your day to day work experience to answer this question. Here are some of the scenarios and responses to help you construct your answer. On a very high level you will follow the below steps.

Understand the symptom
Analyze the situation
Identify the problem areas
Propose solution

Scenario 1 – Job with 100 mappers and 1 reducer takes a long time for the reducer to start after all the mappers are complete. One of the reasons could be that reduce is spending a lot of time copying the map outputs. So in this case we can try couple of things.

1. If possible add a combiner to reduce the amount of output from the mapper to be sent to the reducer
2. Enable map output compression – this will further reduce the size of the outputs to be transferred to the reducer.

Scenario 2 – A particular task is using a lot of memory which is causing the slowness or failure, I will look for ways to reduce the memory usage.

1. Make sure the joins are made in an optimal way with memory usage in mind. For e.g. in Pig joins, the LEFT hand side tables are sent to the reducer first and held in memory and the RIGHT most table in streamed to the reducer. So make sure the RIGHT most table is largest of the datasets in the join.
2. We can also increase the memory requirements needed by the map and reduce tasks by setting – mapreduce.map.memory.mb and mapreduce.reduce.memory.mb

Scenario 3 – Understanding the data helps a lot in optimizing the way we use the datasets in PIG and HIVE scripts.

1. If you have smaller tables in join, they can be sent to distributed cache and loaded in memory on the Map side and the entire join can be done on the Map side thereby avoiding the shuffle and reduce phase altogether. This will tremendously improve performance. Look up USING REPLICATED in Pig and MAPJOIN or hive.auto.convert.join in Hive
2. If the data is already sorted you can use USING MERGE which will do a Map Only join
3. If the data is bucketted in hive, you may use hive.optimize.bucketmapjoin or
hive.optimize.bucketmapjoin.sortedmerge depending on the characteristics of the data

Scenario 4 – The Shuffle process is the heart of a MapReduce program and it can be tweaked for performance improvement.

1. If you see lots of records are being spilled to the disk (check for Spilled Records in the counters in your MapReduce output) you can increase the memory available for Map to perform the Shuffle by increasing the value in io.sort.mb. This will reduce the amount of Map Outputs written to the disk so the sorting of the keys can be performed in memory.
2. On the reduce side the merge operation (merging the output from several mappers) can be done in disk by setting the mapred.inmem.merge.threshold to 0

17. How you will kill a Hadoop job?
Ans: Hadoop job –kill jobID

18. If you have to add a service or install a component in existing Hadoop cluster, how you will do that
Ans: By Hadoop job, you probably mean MapReduce job. If your NN is down, and you don't have spare one (in HA setup) your HDFS will not be working and every component dependent on this HDFS namespace will be either stuck or crashed.

JobTracker (Yarn ResourceManager with hadoop 2.x)
I am not completely sure, but probably job will become submitted and fail afterwards
You cannot submit job to a stopped JobTracker.

19. How you will restart the NameNode?

Ans: The easiest way of doing this is to run the command to stop running shell script i.e. click on stop-all.sh. Once this is done, restarts the NameNode by clicking on start-all.sh.

20. How you will add a Node in Hadoop cluster, what are the steps of it. What are the files you need to edit for it?
Ans:To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.

21. What is your contribution in Hive in your current project?

22. What you do through Oozie in your current project?

23. What are the schedulers available in Hadoop?
Ans: FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.

Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.

Select Hadoop

Sunday, July 17, 2016

Hadoop Admin Interview Question 2

17 comments:

Kafka Architecture

Search This Blog