Select Hadoop: July 2016

Tuesday, July 26, 2016

Installing Hortonwork Cluster on Single Node

Installing Hortonwork Cluster on Single Node

1. Installation Of CentOS7(Minimal) on Machine.

2. Configure the Network while Installation.

3. Start the CentOS .

5. Configure the JAVA.

mkdir /usr/java
tar -zxvf jdk-8u71-linux-x64.tar.gz -C /usr/java

Set variables on vi /etc/profile-
export JAVA_HOME=/usr/java/jdk1.8.0_71
export PATH=$PATH:$JAVA_HOME/bin

6. Set the hostname and hosts file.

7. Mount the CentOS Everthing File.

mkdir /mnt/DVD
mount -r -t iso9660 /dev/sr0 /mnt/DVD

8. Install and enable the httpd Service

yum install -y httpd
service httpd status
service httpd start

9. Create the mannual repo file.

cd /etc/yum.repos.d/

vi Centos.repo

[Centos]
name=Centos
baseurl=file:///mnt/DVD/
enabled=1
gpgcheck=0

vi ambari.repo

[Ambari]
name=Ambari
baseurl=http://192.168.228.133/AMBARI-2.2.1.1/centos7/2.2.1.1-70
enabled=1
gpgcheck=0
---------------------
vi hdp.repo

[HDP_1]
name=HDP_1
baseurl=http://192.168.228.133/HDP/centos7/2.x/updates/2.4.0.0
enabled=1
gpgcheck=0
---------------------
vi hdp-util.repo

[HDP-UTILS]
name=HDP-UTILS
baseurl=http://192.168.228.133/HDP-UTILS-1.1.0.20/repos/centos7/
enabled=1
gpgcheck=0

10. Stop the firewall

"service firewalld stop" On Bootup firewall starts again
systemctl enable httpd"systemctl disable firewalld" permanent disable

11. Extract the both file.

tar -zxvf ambari-2.2.0.0-centos7.tar.gz -C /var/www/html/
tar -zxvf HDP-2.3.4.0-centos7-rpm.tar.gz -C /var/www/html/
tar -zxvf HDP-UTILS-1.1.0.20-centos7.tar.gz -C /var/www/html/

12. Password less SSH

ssh-keygen
cd .ssh/
cat id_rsa.pub >> authorized_keys
ssh node1

13. Disable SELINUX

vi /etc/sysconfig/selinux
"SELINUX=disabled"

14. Restart the Server

15. Install the Ambari Server

yum install ambari-server

16. Install the Postgres SQL

yum install postgresql-jdbc

17. Delete duplicate java entry files.

alternatives --list

18. Setup Ambari with java.

ambari-server setup -j /usr/java/jdk1.8.0_71

19. Setup JDBC Driver with Ambari Server

ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-jdbc.jar

20. Start the Ambari Server.

ambari-server start

Sunday, July 17, 2016

Hadoop Admin Interview Question 2

1. Can you describe your Hadoop journey and current profile and roles and responsibility.

2. What is NameNode Heap memory and how we can configure NameNode heap memory.
Ans: HADOOP_NAMENODE_OPTS="-Xmx500m" will set it to 500MB. The "OPTS" here refers to JVM options. -Xmx is a common JVM option to set the maximum heap.
NameNode heap size depends on many factors such as the number of files, the number of blocks, and the load on the system.

3. How we decide the heap memory limit for a Hadoop Service.
Ans: http://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html
In this example, memory is estimated by considering the capacity of a cluster. Values are rounded. Both clusters physically store 4800 TB, or approximately 36 million block files (at the default block size). Replication determines how many namespace blocks represent these block files.

Cluster A: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=1
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 1 = 128 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 128 MB = 36,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster A needs 36 GB of maximum heap space.

Cluster B: 200 hosts of 24 TB each = 4800 TB.
Blocksize=128 MB, Replication=3
Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
Disk space needed per block: 128 MB per block * 3 = 384 MB storage per block
Cluster capacity in blocks: 4,800,000,000 MB / 384 MB = 12,000,000 blocks
At capacity, with the recommended allocation of 1 GB of memory per million blocks, Cluster B needs 12 GB of maximum heap space.

Both Cluster A and Cluster B store the same number of block files. In Cluster A, however, each block file is unique and represented by one block on the NameNode; in Cluster B, only one-third are unique and two-thirds are replicas.

4. How we decide the heap memory limit for NameNode.
Ans: If we have to increase NameNode heap memory, How we will increase it.
we can not increase the heap memory while running time.

5. What is the use of Standby NameNode in a High availability Hadoop cluster.
Ans: In a High availability, if the connectivity between Active NameNode and Standby NameNode has been lost then what will be the impact on Hadoop cluster.

6. Will the standby NameNode try to become active?

7. In a Hadoop cluster few machines hardware quality is very low. What will be the impact on the job which is running on that machines?

8. What will be the impact on overall cluster performance?

9. What is the difference between dead node and blacklist node?
Ans: When the JobTracker submits jobs to the TaskTracker and the tasks on that node have failed too many times, the JobTracker will blacklisted a TaskTracker.
Dead Node , which are not in cluster or configure but not showing into the cluster

10. When and how a Hadoop cluster make a node as blacklist node?
Ans: When the JobTracker submits jobs to the TaskTracker and the tasks on that node have failed too many times, the JobTracker will blacklisted a TaskTracker.

11. How NameNode decide that a a Node is dead?
Ans: NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead

12. What is speculative execution. What is the benefit of it?

13. How jobs are get scheduled on a Hadoop cluster?
Ans: 1. The client application submits a job to the resource manager.
2. The resource manager takes the job from the job queue and allocates it to an application master. It also manages and monitors resource allocations to each application master and container on the data nodes.
3. The application master divides the job in to tasks and allocates it to each data node.
4. On each data node, a Node manager manages the containers in which the tasks run.
5. The application master will ask the resource manager to allocate more resource to particular containers, if necessary.
6. The application master will keep the resource manager informed as to the status of the jobs allocated to it, and the resource manager will keep the client application informed.

14. Which MapReduce version is configured in your Hadoop cluster?
Ans: mapreduce v2(yarn)

15. What is difference between MapReduce version one and MapReduce version two?

16. How you will identify a Long running job in Hadoop cluster. How you will troubleshoot the long running job?
Ans: This is an open ended question and the interviewer is trying to see the level of hands-on experience you have in solving production issues. Use your day to day work experience to answer this question. Here are some of the scenarios and responses to help you construct your answer. On a very high level you will follow the below steps.

Understand the symptom
Analyze the situation
Identify the problem areas
Propose solution

Scenario 1 – Job with 100 mappers and 1 reducer takes a long time for the reducer to start after all the mappers are complete. One of the reasons could be that reduce is spending a lot of time copying the map outputs. So in this case we can try couple of things.

1. If possible add a combiner to reduce the amount of output from the mapper to be sent to the reducer
2. Enable map output compression – this will further reduce the size of the outputs to be transferred to the reducer.

Scenario 2 – A particular task is using a lot of memory which is causing the slowness or failure, I will look for ways to reduce the memory usage.

1. Make sure the joins are made in an optimal way with memory usage in mind. For e.g. in Pig joins, the LEFT hand side tables are sent to the reducer first and held in memory and the RIGHT most table in streamed to the reducer. So make sure the RIGHT most table is largest of the datasets in the join.
2. We can also increase the memory requirements needed by the map and reduce tasks by setting – mapreduce.map.memory.mb and mapreduce.reduce.memory.mb

Scenario 3 – Understanding the data helps a lot in optimizing the way we use the datasets in PIG and HIVE scripts.

1. If you have smaller tables in join, they can be sent to distributed cache and loaded in memory on the Map side and the entire join can be done on the Map side thereby avoiding the shuffle and reduce phase altogether. This will tremendously improve performance. Look up USING REPLICATED in Pig and MAPJOIN or hive.auto.convert.join in Hive
2. If the data is already sorted you can use USING MERGE which will do a Map Only join
3. If the data is bucketted in hive, you may use hive.optimize.bucketmapjoin or
hive.optimize.bucketmapjoin.sortedmerge depending on the characteristics of the data

Scenario 4 – The Shuffle process is the heart of a MapReduce program and it can be tweaked for performance improvement.

1. If you see lots of records are being spilled to the disk (check for Spilled Records in the counters in your MapReduce output) you can increase the memory available for Map to perform the Shuffle by increasing the value in io.sort.mb. This will reduce the amount of Map Outputs written to the disk so the sorting of the keys can be performed in memory.
2. On the reduce side the merge operation (merging the output from several mappers) can be done in disk by setting the mapred.inmem.merge.threshold to 0

17. How you will kill a Hadoop job?
Ans: Hadoop job –kill jobID

18. If you have to add a service or install a component in existing Hadoop cluster, how you will do that
Ans: By Hadoop job, you probably mean MapReduce job. If your NN is down, and you don't have spare one (in HA setup) your HDFS will not be working and every component dependent on this HDFS namespace will be either stuck or crashed.

JobTracker (Yarn ResourceManager with hadoop 2.x)
I am not completely sure, but probably job will become submitted and fail afterwards
You cannot submit job to a stopped JobTracker.

19. How you will restart the NameNode?

Ans: The easiest way of doing this is to run the command to stop running shell script i.e. click on stop-all.sh. Once this is done, restarts the NameNode by clicking on start-all.sh.

20. How you will add a Node in Hadoop cluster, what are the steps of it. What are the files you need to edit for it?
Ans:To add new nodes to the HDFS cluster, the hostnames should be added to the slaves file and then DataNode and TaskTracker should be started on the new node.
To remove or decommission nodes from the HDFS cluster, the hostnames should be removed from the slaves file and –refreshNodes should be executed.

21. What is your contribution in Hive in your current project?

22. What you do through Oozie in your current project?

23. What are the schedulers available in Hadoop?
Ans: FIFO Scheduler – This scheduler does not consider the heterogeneity in the system but orders the jobs based on their arrival times in a queue.
COSHH- This scheduler considers the workload, cluster and the user heterogeneity for scheduling decisions.

Fair Sharing-This Hadoop scheduler defines a pool for each user. The pool contains a number of map and reduce slots on a resource. Each user can use their own pool to execute the jobs.

Saturday, July 16, 2016

Hadoop Admin Interview Question 1

Hadoop Admin Interview Question

1: Can you describe about your current roles and responsibility or day to day activity.
2: Please describe the YARN Architecture.
3: What is Name Node Heap memory and how we can configure heap memory.
4: What do you install Hadoop cluster, please describe in detail, what are the service and component you install during Hadoop installation
5: How do you enable repository during installation and what details you provide there
6: What is Meta store and how do you connect it
7: if hive Meta store service is down, then what will be impact on Hadoop cluster.
8: Do we install Hive service on every nodes in Hadoop Cluster?
9: What is beeline?
10: What is hiveserver2?
11: How do you connect hive through beeline
12: What is thrift client?
13: What is job tracker and Resource Manager?
14: What is the use of ZooKeeper services and why we need it?
15: How do you troubleshoot if Name Node is down in Hadoop version 1 and also on Hadoop Version 2?
16: How do you troubleshoot if some services are down in Hadoop cluster?
17: How do you troubleshoot slow running job.
18: What are the benefit of using YARN?
19: Is it possible to run MRV1 and MRV2 run on single cluster?
20: What is FIFO scheduler?
21: what is Capacity scheduler
22: Difference between FIFO and Capacity scheduler
23: How do you executer job on cluster using FIFO scheduler
24: How do you identify a long running job in a large busy cluster?
25: How do you kill Hadoop job, if the cluster is configured with capacity scheduler.
26: What is Kerberos realm, how do you define it.
27: How do you define and create a Kerberos principle
28: How do you add new user in Hadoop cluster.
29: How do you define permissions to user for particular directory in Hadoop Cluster?
30: How do we decide the heap memory limit for a Hadoop?
31: How do we decide the heap memory limit for Name Node?
32: How do you increase the Name node heap memory?
33: What is Standby Name Node and what is High availability Hadoop cluster?
34: How do you resolve connectivity issue of Active Name Node and Standby Name node and what will be the impact on Hadoop cluster and will the standby Name Node try to become active.
35: Few Data node is running slow. What will be the impact on the job which is running on those data node and what will be the impact on overall cluster performance.
36: What is the difference between dead node and blacklist node and how node becomes blacklist node?
37: How Name Node decide which Node is dead.
38: What is speculative execution? What it does?
39: How do you schedule jobs in Hadoop cluster?
40: Which version of MapReduce you are using.
41: Difference between MapReduce version one and MapReduce version two.
42: How do you identify a long running job and how do you troubleshoot that
43: How do you kill job.
44: How do you add a service or install a component in existing Hadoop cluster.
45: How do you restart the Name Node?
46: How do you add or remove data node in Hadoop cluster, what are the steps and what files you edit for it.
47: What is Hive and what are the work you have done on Hive.
48: What is Oozie and how do you use in it.
49: What are the schedulers available in Hadoop?
50: When you submit a spark job in Hadoop 2.x. how spark interact with YARN, how resources are negotiated with SPARK in YARN.
51: What is spark context? What is the use of it?
52: Why spark job can run only in Hadoop 2.x not in 1.x
53: What is default YARN scheduler?
54: How jobs are gets scheduled in YARN. Which component is responsible for it? How container do the resource allocation in YARN
55: If you submit a SPARK job in Hadoop cluster, how container do the resource negotiation for SPARK job
56: How do you troubleshoot if data node is down, what are the logs file you check.
57: How do you increase storage capacity of Hadoop Cluster?
58: What happens after adding new data node in Hadoop cluster?
59: What is balancer, how do you schedule it.
60: You try to login on a machine of your cluster and you are getting timeout exception. What could be the issue for it? What will be your steps to resolve it?
61: How do you start the process in Linux?
62: in which case speculative exception in not beneficial
63: When we run a MapReduce job, what are the process involved in Mapper side? Before going to reducer?

Thursday, July 14, 2016

Hadoop Configuration Files

Hadoop Configuration Files

1. hadoop-env.sh

This file specifies environment variables that affect the JDK used by Hadoop Daemon (bin/hadoop).
As Hadoop framework is written in Java and uses Java Runtime environment, one of the important environment variables for Hadoop daemon is $JAVA_HOME in hadoop-env.sh. This variable directs Hadoop daemon to the Java path in the system.

This file is also used for setting another Hadoop daemon execution environment such as heap size (HADOOP_HEAP), hadoop home (HADOOP_HOME), log file location (HADOOP_LOG_DIR), etc.

Note: For the simplicity of understanding the cluster setup, we have configured only necessary parameters to start a cluster.

2. core-site.sh

This file informs Hadoop daemon where NameNode runs in the cluster. It contains the configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

Name node, Hadoop daemon, Configuration settings,Hadoop Core

Where hostname and port are the machine and port on which NameNode daemon runs and listens. It also informs the Name Node as to which IP and port it should bind. The commonly used port is 8020 and you can also specify IP address rather than hostname.

3. hdfs-site.sh

This file contains the configuration settings for HDFS daemons; the Name Node, the Secondary Name Node, and the data nodes.

You can also configure hdfs-site.xml to specify default block replication and permission checking on HDFS. The actual number of replications can also be specified when the file is created. The default is used if replication is not specified in create time.

The value “true” for property ‘dfs.permissions’ enables permission checking in HDFS and the value “false” turns off the permission checking. Switching from one parameter value to the other does not change the mode, owner or group of files or directories.

HDFS daemons; the Name Node, the Secondary Name Node, and the data nodes.

4.mapred-site.sh

This file contains the configuration settings for MapReduce daemons; the job tracker and the task-trackers. The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job Tracker listens for RPC communication. This parameter specify the location of the Job Tracker to Task Trackers and MapReduce clients.

MapReduce daemons; the job tracker and the task-trackers

You can replicate all of the four files explained above to all the Data Nodes and Secondary Namenode. These files can then be configured for any node specific configuration e.g. in case of a different JAVA HOME on one of the Datanodes.

5.Masters

This file informs about the Secondary Namenode location to hadoop daemon. The ‘masters’ file at Master server contains a hostname Secondary Name Node servers.

Secondary Namenode location, hadoop daemon

6.Slaves

The ‘slaves’ file at Master node contains a list of hosts, one per line, that are to host Data Node and Task Tracker servers.

Slaves file, Master node, Hadoop

The ‘slaves’ file on Slave server contains the IP address of the slave node. Notice that the ‘slaves’ file at Slave node contains only its own IP address and not of any other Data Nodes in the cluster.

Installation of Hadoop 1.x

Installation of Hadoop 1.x

1. Configure the hostname
vi /etc/hostname
2. Configure reverse DNS
vi /etc/hosts
3. Create directory

mkdir /usr/java
mkdir /usr/hadoop
mkdir /usr/hadoop/data
mkdir /usr/hadoop/namenode
mkdir /usr/hadoop/tmp

4. Configure Java

5. Add group
groupadd hadoop
useradd -g hadoop hduser
passwd hduser

6. Extract Hadoop file

7. Give permission to Hadoop folder of hduser

chown -R hduser:hadoop /usr/hadoop

8. Password less SSH

ssh-keygen
cd .ssh/
cat id_rsa.pub >> authorized_keys

9. Configure .bashrc file

export HADOOP_HOME=/usr/hadoop/hadoop-1.2.1/
export PATH=$PATH:$HADOOP_HOME/bin

10. Configure Hadoop Configuration file.

hdfs-site.xml

<configuration>
<property>
<name>dfs.data.dir</name>
<value>/usr/hadoop/data</value>
<final>true</final>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/hadoop/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
</description>
</property>
</configuration>

core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://node1.hadoop.com:54310</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

mapred-site.xml

<property>
<name>mapred.job.tracker</name>
<value>node1.hadoop.com:54311</value>
<description>The host and port that the MapReduce job tracker runs at. Tf "local", then jobs are in-process as a single map and reduce task.</description>
</property>

Slave and master file : add the hostname into it.

11. Format the namenode

hadoop namenode -format -force

12. run the start-all.sh file.

Select Hadoop