1. Can you describe your Hadoop journey and current profile and roles and responsibility.
2. What is NameNode Heap memory and how we can configure NameNode heap memory.
Ans: HADOOP_NAMENODE_OPTS="-Xmx500m" will set it to 500MB. The "OPTS" here refers to JVM options. -Xmx is a common JVM option to set the maximum heap.
NameNode heap size depends on many factors such as the number of files, the number of blocks, and the load on the system.
3. How we decide the heap memory limit for a Hadoop Service.
Ans: http://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html
In this example, memory is estimated by considering the capacity of a cluster. Values are rounded. Both clusters physically store 4800 TB, or approximately 36 million block files (at the default block size). Replication determines how many namespace blocks represent these block files.
Cluster A: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=1
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 1 = 128 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 128 MB = 36,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster A needs 36 GB of maximum heap space.
Cluster B: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=3
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 3 = 384 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 384 MB = 12,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster B needs 12 GB of maximum heap space.
Both Cluster A and Cluster B store the same number of block files. In Cluster A, however, each block file is unique and represented by one block on the NameNode; in Cluster B, only one-third are unique and two-thirds are replicas.
4. How we decide the heap memory limit for NameNode.
Ans: If we have to increase NameNode heap memory, How we will increase it.
we can not increase the heap memory while running time.
5. What is the use of Standby NameNode in a High availability Hadoop cluster.
Ans: In a High availability, if the connectivity between Active NameNode and Standby NameNode has been lost then what will be the impact on Hadoop cluster.
6. Will the standby NameNode try to become active?
7. In a Hadoop cluster few machines hardware quality is very low. What will be the impact on the job which is running on that machines?
8. What will be the impact on overall cluster performance?
9. What is the difference between dead node and blacklist node?
Ans: When the JobTracker submits jobs to the TaskTracker and the tasks on that node have failed too many times, the JobTracker will blacklisted a TaskTracker.
Dead Node , which are not in cluster or configure but not showing into the cluster
10. When and how a Hadoop cluster make a node as blacklist node?
Ans: When the JobTracker submits jobs to the TaskTracker and the tasks on that node have failed too many times, the JobTracker will blacklisted a TaskTracker.
11. How NameNode decide that a a Node is dead?
Ans: NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead
12. What is speculative execution. What is the benefit of it?
13. How jobs are get scheduled on a Hadoop cluster?
Ans: 1. The client application submits a job to the resource manager.
2. The resource manager takes the job from the job queue and allocates it to an application master. It also manages and monitors resource allocations to each application master and container on the data nodes.
3. The application master divides the job in to tasks and allocates it to each data node.
4. On each data node, a Node manager manages the containers in which the tasks run.
5. The application master will ask the resource manager to allocate more resource to particular containers, if necessary.
6. The application master will keep the resource manager informed as to the status of the jobs allocated to it, and the resource manager will keep the client application informed.
14. Which MapReduce version is configured in your Hadoop cluster?
Ans: mapreduce v2(yarn)
15. What is difference between MapReduce version one and MapReduce version two?
16. How you will identify a Long running job in Hadoop cluster. How you will troubleshoot the long running job?
Ans: This is an open ended question and the interviewer is trying to see the level of hands-on experience you have in solving production issues. Use your day to day work experience to answer this question. Here are some of the scenarios and responses to help you construct your answer. On a very high level you will follow the below steps.
Scenario 1 – Job with 100 mappers and 1 reducer takes a long time for the reducer to start after all the mappers are complete. One of the reasons could be that reduce is spending a lot of time copying the map outputs. So in this case we can try couple of things.
1. If possible add a combiner to reduce the amount of output from the mapper to be sent to the reducer
2. Enable map output compression – this will further reduce the size of the outputs to be transferred to the reducer.
Scenario 2 – A particular task is using a lot of memory which is causing the slowness or failure, I will look for ways to reduce the memory usage.
1. Make sure the joins are made in an optimal way with memory usage in mind. For e.g. in Pig joins, the LEFT hand side tables are sent to the reducer first and held in memory and the RIGHT most table in streamed to the reducer. So make sure the RIGHT most table is largest of the datasets in the join.
2. We can also increase the memory requirements needed by the map and reduce tasks by setting – mapreduce.map.memory.mb and mapreduce.reduce.memory.mb
Scenario 3 – Understanding the data helps a lot in optimizing the way we use the datasets in PIG and HIVE scripts.
1. If you have smaller tables in join, they can be sent to distributed cache and loaded in memory on the Map side and the entire join can be done on the Map side thereby avoiding the shuffle and reduce phase altogether. This will tremendously improve performance. Look up USING REPLICATED in Pig and MAPJOIN or hive.auto.convert.join in Hive
2. If the data is already sorted you can use USING MERGE which will do a Map Only join
3. If the data is bucketted in hive, you may use hive.optimize.bucketmapjoin or
hive.optimize.bucketmapjoin.sortedmerge depending on the characteristics of the data
Scenario 4 – The Shuffle process is the heart of a MapReduce program and it can be tweaked for performance improvement.
1. If you see lots of records are being spilled to the disk (check for Spilled Records in the counters in your MapReduce output) you can increase the memory available for Map to perform the Shuffle by increasing the value in io.sort.mb. This will reduce the amount of Map Outputs written to the disk so the sorting of the keys can be performed in memory.
2. On the reduce side the merge operation (merging the output from several mappers) can be done in disk by setting the mapred.inmem.merge.threshold to 0
17. How you will kill a Hadoop job?
Ans: Hadoop job –kill jobID
18. If you have to add a service or install a component in existing Hadoop cluster, how you will do that
Ans: By Hadoop job, you probably mean MapReduce job. If your NN is down, and you don't have spare one (in HA setup) your HDFS will not be working and every component dependent on this HDFS namespace will be either stuck or crashed.
19. How you will restart the NameNode?
Ans: The easiest way of doing this is to run the command to stop running shell script i.e. click on stop-all.sh. Once this is done, restarts the NameNode by clicking on start-all.sh.
20. How you will add a Node in Hadoop cluster, what are the steps of it. What are the files you need to edit for it?
Ans:To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.
21. What is your contribution in Hive in your current project?
22. What you do through Oozie in your current project?
23. What are the schedulers available in Hadoop?
Ans: FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.
Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.
2. What is NameNode Heap memory and how we can configure NameNode heap memory.
Ans: HADOOP_NAMENODE_OPTS="-Xmx500m" will set it to 500MB. The "OPTS" here refers to JVM options. -Xmx is a common JVM option to set the maximum heap.
NameNode heap size depends on many factors such as the number of files, the number of blocks, and the load on the system.
3. How we decide the heap memory limit for a Hadoop Service.
Ans: http://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html
In this example, memory is estimated by considering the capacity of a cluster. Values are rounded. Both clusters physically store 4800 TB, or approximately 36 million block files (at the default block size). Replication determines how many namespace blocks represent these block files.
Cluster A: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=1
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 1 = 128 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 128 MB = 36,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster A needs 36 GB of maximum heap space.
Cluster B: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=3
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 3 = 384 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 384 MB = 12,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster B needs 12 GB of maximum heap space.
Both Cluster A and Cluster B store the same number of block files. In Cluster A, however, each block file is unique and represented by one block on the NameNode; in Cluster B, only one-third are unique and two-thirds are replicas.
4. How we decide the heap memory limit for NameNode.
Ans: If we have to increase NameNode heap memory, How we will increase it.
we can not increase the heap memory while running time.
5. What is the use of Standby NameNode in a High availability Hadoop cluster.
Ans: In a High availability, if the connectivity between Active NameNode and Standby NameNode has been lost then what will be the impact on Hadoop cluster.
6. Will the standby NameNode try to become active?
7. In a Hadoop cluster few machines hardware quality is very low. What will be the impact on the job which is running on that machines?
8. What will be the impact on overall cluster performance?
9. What is the difference between dead node and blacklist node?
Ans: When the JobTracker submits jobs to the TaskTracker and the tasks on that node have failed too many times, the JobTracker will blacklisted a TaskTracker.
Dead Node , which are not in cluster or configure but not showing into the cluster
10. When and how a Hadoop cluster make a node as blacklist node?
Ans: When the JobTracker submits jobs to the TaskTracker and the tasks on that node have failed too many times, the JobTracker will blacklisted a TaskTracker.
11. How NameNode decide that a a Node is dead?
Ans: NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead
12. What is speculative execution. What is the benefit of it?
13. How jobs are get scheduled on a Hadoop cluster?
Ans: 1. The client application submits a job to the resource manager.
2. The resource manager takes the job from the job queue and allocates it to an application master. It also manages and monitors resource allocations to each application master and container on the data nodes.
3. The application master divides the job in to tasks and allocates it to each data node.
4. On each data node, a Node manager manages the containers in which the tasks run.
5. The application master will ask the resource manager to allocate more resource to particular containers, if necessary.
6. The application master will keep the resource manager informed as to the status of the jobs allocated to it, and the resource manager will keep the client application informed.
14. Which MapReduce version is configured in your Hadoop cluster?
Ans: mapreduce v2(yarn)
15. What is difference between MapReduce version one and MapReduce version two?
16. How you will identify a Long running job in Hadoop cluster. How you will troubleshoot the long running job?
Ans: This is an open ended question and the interviewer is trying to see the level of hands-on experience you have in solving production issues. Use your day to day work experience to answer this question. Here are some of the scenarios and responses to help you construct your answer. On a very high level you will follow the below steps.
- Understand the symptom
- Analyze the situation
- Identify the problem areas
- Propose solution
Scenario 1 – Job with 100 mappers and 1 reducer takes a long time for the reducer to start after all the mappers are complete. One of the reasons could be that reduce is spending a lot of time copying the map outputs. So in this case we can try couple of things.
1. If possible add a combiner to reduce the amount of output from the mapper to be sent to the reducer
2. Enable map output compression – this will further reduce the size of the outputs to be transferred to the reducer.
Scenario 2 – A particular task is using a lot of memory which is causing the slowness or failure, I will look for ways to reduce the memory usage.
1. Make sure the joins are made in an optimal way with memory usage in mind. For e.g. in Pig joins, the LEFT hand side tables are sent to the reducer first and held in memory and the RIGHT most table in streamed to the reducer. So make sure the RIGHT most table is largest of the datasets in the join.
2. We can also increase the memory requirements needed by the map and reduce tasks by setting – mapreduce.map.memory.mb and mapreduce.reduce.memory.mb
Scenario 3 – Understanding the data helps a lot in optimizing the way we use the datasets in PIG and HIVE scripts.
1. If you have smaller tables in join, they can be sent to distributed cache and loaded in memory on the Map side and the entire join can be done on the Map side thereby avoiding the shuffle and reduce phase altogether. This will tremendously improve performance. Look up USING REPLICATED in Pig and MAPJOIN or hive.auto.convert.join in Hive
2. If the data is already sorted you can use USING MERGE which will do a Map Only join
3. If the data is bucketted in hive, you may use hive.optimize.bucketmapjoin or
hive.optimize.bucketmapjoin.sortedmerge depending on the characteristics of the data
Scenario 4 – The Shuffle process is the heart of a MapReduce program and it can be tweaked for performance improvement.
1. If you see lots of records are being spilled to the disk (check for Spilled Records in the counters in your MapReduce output) you can increase the memory available for Map to perform the Shuffle by increasing the value in io.sort.mb. This will reduce the amount of Map Outputs written to the disk so the sorting of the keys can be performed in memory.
2. On the reduce side the merge operation (merging the output from several mappers) can be done in disk by setting the mapred.inmem.merge.threshold to 0
17. How you will kill a Hadoop job?
Ans: Hadoop job –kill jobID
18. If you have to add a service or install a component in existing Hadoop cluster, how you will do that
Ans: By Hadoop job, you probably mean MapReduce job. If your NN is down, and you don't have spare one (in HA setup) your HDFS will not be working and every component dependent on this HDFS namespace will be either stuck or crashed.
- JobTracker (Yarn ResourceManager with hadoop 2.x)
- I am not completely sure, but probably job will become submitted and fail afterwards
- You cannot submit job to a stopped JobTracker.
19. How you will restart the NameNode?
20. How you will add a Node in Hadoop cluster, what are the steps of it. What are the files you need to edit for it?
Ans:To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.
21. What is your contribution in Hive in your current project?
22. What you do through Oozie in your current project?
23. What are the schedulers available in Hadoop?
Ans: FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.
Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.
The article is very easy to under stand Hadoop Admin Online Training Hyderabad
ReplyDeleteThank You
Deleteawesome post presented by you..your writing style is fabulous and keep update with your blogs importent information on questions and answers on bigdata hadoop read more....
ReplyDeleteBig Data Hadoop online training India
To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
DeleteTo remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.
Big Data Projects For Final Year Students
Image Processing Projects For Final Year
Deep Learning Projects for Final Year
Hi, Your post is good. Its so useful for me about Hadoop.Thank you for your post.
ReplyDeleteHadoop Training in Chennai
It is nice blog Thank you porovide important information and i am searching for same information to save my time Big Data Hadoop Online Training
ReplyDeleteBig Data Hadoop Online Course
Hi,
ReplyDeleteThanks for sharing this hadoop admin interview questions...
Hadoop Online Training
Thanks for sharing useful information on Hadoop admin. Hadoop admin is going to be future of the computing world in the coming years. This field is a very good option that provides huge offers as career prospects for beginners and talented professionals. Get course from best hadoop admin training institute in bangalore which will help you to start good career in big data technology.
ReplyDeleteVery Awesome explanation blog,keep sharing more blogs with us.
ReplyDeleteThank you.....
hadoop admin online training
hadoop admin certification
Malatya
ReplyDeleteKırıkkale
Aksaray
Bitlis
Manisa
FPMN
ankara parça eşya taşıma
ReplyDeletetakipçi satın al
antalya rent a car
antalya rent a car
ankara parça eşya taşıma
3LVMAA
878E4
ReplyDeleteBurdur Parça Eşya Taşıma
Bursa Lojistik
Ankara Parke Ustası
Maraş Şehir İçi Nakliyat
Denizli Şehirler Arası Nakliyat
Ünye Koltuk Kaplama
Amasya Parça Eşya Taşıma
Kars Şehir İçi Nakliyat
Bayburt Şehirler Arası Nakliyat
2CCEB
ReplyDeleteBtcturk Güvenilir mi
Referans Kimliği Nedir
order steroids
Manisa Evden Eve Nakliyat
order halotestin
buy trenbolone enanthate
order winstrol stanozolol
Gümüşhane Evden Eve Nakliyat
Hakkari Evden Eve Nakliyat
0D691
ReplyDeleteorder sustanon
Yozgat Evden Eve Nakliyat
boldenone
deca durabolin
Urfa Evden Eve Nakliyat
buy testosterone propionat
Bitlis Evden Eve Nakliyat
primobolan
Muş Evden Eve Nakliyat
AF193
ReplyDeleteOsmaniye Şehirler Arası Nakliyat
Denizli Şehir İçi Nakliyat
Çerkezköy Motor Ustası
Baby Doge Coin Hangi Borsada
Çerkezköy Çatı Ustası
Erzincan Lojistik
Karaman Parça Eşya Taşıma
Düzce Parça Eşya Taşıma
Sinop Lojistik
3A4D0
ReplyDeleteücretsiz sohbet odaları
erzincan seslı sohbet sıtelerı
parasız sohbet siteleri
Muğla Sesli Sohbet Uygulamaları
Urfa Sohbet Uygulamaları
kadınlarla görüntülü sohbet
sohbet odaları
sesli sohbet uygulamaları
Ordu Canlı Sohbet Siteleri
BB3CA
ReplyDeleteCoin Nasıl Alınır
Binance Ne Zaman Kuruldu
Binance Madenciliği Nedir
Bitcoin Nasıl Kazılır
Alyattes Coin Hangi Borsada
Btcst Coin Hangi Borsada
Bitcoin Nasıl Çıkarılır
Periscope Takipçi Hilesi
Binance Komisyon Ne Kadar
AFE3A
ReplyDeletepinksale
spookyswap
pancakeswap
pancakeswap
dappradar
dexscreener
uniswap
dao maker
zkswap
YHGN JKMHMK
ReplyDeleteشركة تسليك مجاري بالقطيف
شركة تسليك مجاري بالخبر buBJ0WdXBX
ReplyDeleteشركة رش مبيدات بالاحساء HZmJrLpRpk
ReplyDeleteتسليك مجاري SpI6nXXfGw
ReplyDeleteشركة مكافحة الحمام بالخبر IkiEH4tcwN
ReplyDeleteشركة مكافحة بق الفراش بالاحساء 8GvwYp9D4I
ReplyDelete6728C7ADB2
ReplyDeletewww.steroidsatinal.online
görüntülü show
cialis
steroid satın al
görüntülü show
www.ijuntaxmedikal.store