Select Hadoop: February 2017

Tuesday, February 14, 2017

Decommissioning Slave Nodes

Decommissioning Slave Nodes

Hadoop provides the decommission feature to retire a set of existing slave nodes (DataNodes, NodeManagers, or HBase RegionServers) in order to prevent data loss. Slaves nodes are frequently decommissioned for maintainance. As a Hadoop administrator, you will decommission the slave nodes periodically in order to either reduce the cluster size or to gracefully remove dying nodes.

Prerequisites
• Ensure that the following property is defined in your hdfs-site.xml file.

<property>
<name>dfs.hosts.exclude</name>
<value><HADOOP_CONF_DIR>/dfs.exclude</value>
<final>true</final>
</property>

where <HADOOP_CONF_DIR> is the directory for storing the Hadoop configuration files.
For example, /etc/hadoop/conf.

• Ensure that the following property is defined in your yarn-site.xml file.

<property>
<name>yarn.resourcemanager.nodes.exclude-path</name>
<value><HADOOP_CONF_DIR>/yarn.exclude</value>
<final>true</final>
</property>

where <HADOOP_CONF_DIR> is the directory for storing the Hadoop configuration files.
For example, /etc/hadoop/conf.

Decommission DataNodes or NodeManagers

Nodes normally run both a DataNode and a NodeManager, and both are typically commissioned or decommissioned together.

With the replication level set to three, HDFS is resilient to individual DataNodes failures. However, there is a high chance of data loss when you terminate DataNodes without decommissioning them first. Nodes must be decommissioned on a schedule that permits replication of blocks being decommissioned.

On the other hand, if a NodeManager is shut down, the ResourceManager will reschedule the tasks on other nodes in the cluster. However, decommissioning a NodeManager may be required in situations where you want a NodeManager to stop to accepting new tasks, or when the tasks take time to execute but you still want to be agile in your cluster management.

Decommission DataNodes

Use the following instructions to decommission DataNodes in your cluster:

• On the NameNode host machine, edit the <HADOOP_CONF_DIR>/dfs.exclude file and add the list of DataNodes hostnames (separated by a newline character). where <HADOOP_CONF_DIR> is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

• Update the NameNode with the new set of excluded DataNodes. On the NameNode host machine, execute the following command:

su <HDFS_USER>
hdfs dfsadmin -refreshNodes

where <HDFS_USER> is the user owning the HDFS services. For example, hdfs.

• Open the NameNode web UI (http://<NameNode_FQDN>:50070) and navigate to the DataNodes page. Check to see whether the state has changed to Decommission In Progress for the DataNodes being decommissioned.

• When all the DataNodes report their state as Decommissioned (on the DataNodes page, or on the Decommissioned Nodes page at http://<NameNode_FQDN>:8088/cluster/ nodes/decommissioned), all of the blocks have been replicated. You can then shut down the decommissioned nodes.

• If your cluster utilizes a dfs.include file, remove the decommissioned nodes from the <HADOOP_CONF_DIR>/dfs.include file on the NameNode host machine, then execute the following command:

su <HDFS_USER>
hdfs dfsadmin -refreshNodes

Note: If no dfs.include file is specified, all DataNodes are considered to be included in the cluster (unless excluded in the dfs.exclude file). The #dfs.hosts and dfs.hosts.exclude properties in hdfs-site.xml are used to specify the dfs.include and dfs.exclude files.

Decommission NodeManagers

Use the following instructions to decommission NodeManagers in your cluster:

• On the NameNode host machine, edit the <HADOOP_CONF_DIR>/yarn.exclude file
and add the list of NodeManager hostnames (separated by a newline character).
where <HADOOP_CONF_DIR> is the directory for storing the Hadoop configuration files.
For example, /etc/hadoop/conf.

• If your cluster utilizes a yarn.include file, remove the decommissioned nodes from
the <HADOOP_CONF_DIR>/yarn.include file on the ResourceManager host machine.

Note:
If no yarn.include file is specified, all NodeManagers are considered to be included in the cluster (unless excluded in the yarn.exclude file). The yarn.resourcemanager.nodes.include-path and
yarn.resourcemanager.nodes.exclude-path properties in yarnsite.xml are used to specify the yarn.include and yarn.exclude
files.

• Update the ResourceManager with the new set of NodeManagers. On the ResourceManager host machine, execute the following command:

su <YARN_USER>
yarn rmadmin -refreshNodes where <YARN_USER> is the user who owns the YARN services, for example, yarn.

Decommission HBase RegionServers

Use the following instruction to decommission HBase RegionServers in your cluster. At the RegionServer that you want to decommission, execute:

su <HBASE_USER>
/usr/hdp/current/hbase-client/bin/hbase-daemon.sh stop

where <HBASE_USER> is the user who owns the HBase Services. For example, hbase.

RegionServer closes all the regions, then shuts down.

Monday, February 13, 2017

Rack Awareness in Hadoop

Rack awareness is having the knowledge of Cluster topology or more specifically how the different data nodes are distributed across the racks of a Hadoop cluster. The importance of this knowledge relies on this assumption that collocated data nodes inside a specific rack will have more bandwidth and less latency whereas two data nodes in separate racks will have comparatively less bandwidth and higher latency.

Hadoop components are rack-aware. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.

The main purpose of Rack awareness is:

Increasing the availability of data block.
Better cluster performance.

Let us assume the cluster has 9 Data Nodes with replication factor 3.
Let us also assume that there are 3 physical racks where these machines are placed:

    Rack1: DN1;DN2;DN3
    Rack2: DN4;DN5;DN6
    Rack3: DN7:DN8;DN9

The following diagram depicts an example block placement when HDFS and Yarn are not rack aware:

What happens if Rack1 goes down? -> Potentially data in Block1 might be lost
- Not being Rack aware the entire cluster is thought of placed in default-rack

The following diagram depicts an example block placement when HDFS and Yarn are rack aware:

Wednesday, February 8, 2017

Calculating the Capacity of a Node

Because YARN has now removed the hard partitioned mapper and reducer slots of Hadoop

Version 1, new capacity calculations are required. There are eight important parameters

for calculating a node’s capacity that are specified in mapred-site.xml and yarnsite.xml:

In mapred-site.xml:

1. mapreduce.map.memory.mb

2. mapreduce.reduce.memory.mb

These are the hard limits enforced by Hadoop on each mapper or reducer task.

1. mapreduce.map.java.opts

2. mapreduce.reduce.java.opts

The heapsize of the jvm –Xmx for the mapper or reducer task. Remember to leave room
for the JVM Perm Gen and Native Libs used. This value should always be lower than
mapreduce.[map\|reduce].memory.mb.

In yarn-site.xml:

• yarn.scheduler.minimum-allocation-mb

The smallest container that YARN will allow.

• yarn.scheduler.maximum-allocation-mb

The largest container that YARN will allow.

• yarn.nodemanager.resource.memory-mb

The amount of physical memory (RAM) for Containers on the compute node. It is

important that this is not equal to the total amount of RAM on the node, as other

Hadoop services also require RAM.

• yarn.nodemanager.vmem-pmem-ratio

The amount of virtual memory that each Container is allowed. This can be calculated with:

containerMemoryRequest*vmem-pmem-ratio

As an example, consider a configuration with the settings in the following table.

Example YARN MapReduce Settings

Property

Value

mapreduce.map.memory.mb

1536

mapreduce.reduce.memory.mb

2560

mapreduce.map.java.opts

-Xmx1024m

mapreduce.reduce.java.opts

-Xmx2048m

yarn.scheduler.minimum-allocation-mb

512

yarn.scheduler.maximum-allocation-mb

4096

yarn.nodemanager.resource.memory-mb

36864

yarn.nodemanager.vmem-pmem-ratio

2.1

With these settings, each map and reduce task has a generous 512MB of overhead

for the Container, as evidenced by the difference between the mapreduce.[map|

reduce].memory.mb and the mapreduce.[map|reduce].java.opts.

Next, YARN has been configured to allow a Container no smaller than 512MB and no

larger than 4GB. The compute nodes have 36GB of RAM available for Containers. With

a virtual memory ratio of 2.1 (the default value), each map can have up to 3225.6MB of

RAM, or a reducer can have 5376MB of virtual RAM.

This means that the compute node configured for 36GB of Container space can support

up to 24 maps or 14 reducers, or any combination of mappers and reducers allowed by the

available resources on the node.

Tuesday, February 7, 2017

Common Client-Side Issues

*Job Fails to Start

Symptom: Exception When

Troubleshooting Steps:

Job Submitted, Potential Root

• Examine Node Manager/Resource Manager logs and

task-logs to find the exact exception.

Cause: Mistake in the Job's

Resolution Steps:

User Code

• Examine the stack-trace for the thrown exception.

• Examine the user code to see if you can spot the

error.

Information to Collect:

• Resource Manager log.

• Node Manager log.

• The exception trace that the user has mentioned, and

the task logs.

• If possible, get at least a snippet of Java code from

the area where the exception was thrown.

Symptom: "No Class Def Found"

Troubleshooting Steps:

or Similar Exception When

• Verify that the exception is ClassNotFound,

Trying to Start Job, Potential

NoSuchMethodError, or a similar exception.

Root Cause 1: Job's .jar File

Resolution Steps:

-- or Other .jar File -- Not on

• Find the .jar file that contains the missing class and

Classpath

add it to the classpath.

Information to Collect:

• The entire command used to submit the job.

• The stack-trace from the Node Manager logs.

Potential Root Cause 2: Main

Troubleshooting Steps:

Class or Method of the Job

• Examine the code for the main MRv2 class.

Code is not "Public Static"

Resolution Steps:

• Set access modifiers to "public static"

• Recompile and re-test.

Information to Collect:

• The exact exception thrown by Hadoop.

• The job source code.

Job Seems to Hang in Setup

Symptom: Job Seems to Hang

Troubleshooting Steps:

and Node Manager Becomes

• Verify the amount of system memory.

Blacklisted, Potential Root

• Calculate the required memory for each configured

Cause: Too Many Allowed

Container.

Slots Configured for the System

• Take into account any other processes running on the

Memory on the Node

node.

Resolution Steps:

• Add all of the above. If the total is greater than the

total available on the node, you will need to reduce

the amount configured in the Container properties.

Symptom: Job Seems to

Troubleshooting Steps:

Hang Without "Blacklisting",

• Verify the number of available MRv2 tasks available

Potential Root Cause: No Node

by looking at:

Managers Currently Available

<Resource Manager host>:8088/cluster/nodes

Resolution Steps:

• Wait until more Node Managers become available,

then see if the job runs.

Information to Collect:

• None until the job actually fails to run, then

troubleshoot based on the failure symptom.

Monday, February 6, 2017

Error: Could Only be Replicated to x Nodes, Instead of n -Server Side Issue

Potential Root Cause: At Least				Troubleshooting Steps:
One DataNode is Nonfunctiona				• This is a problem with HDFS. Refer to the HDFS
				Troubleshooting Guide.

Potential Root Cause: One				Troubleshooting Steps:
or More DataNodes Are Out				• This is a problem with HDFS. Refer to the HDFS
of Space on Their Currently				Troubleshooting Guide.
Available Disk Drives

Node Manager Denied Communication with Resource Manager -Server Side Issue

Potential Root Cause:				Troubleshooting Steps:
Hostname in Exclude File or				• Verify the contents of the files referenced in the
Doesn't Exists in Include File				yarn.resourcemanager.nodes.excludepath
				property or the
				yarn.resourcemanager.nodes.include-path
				property.
				• Verify that the host for this Node Manager is not
				being decommissioned.
				Resolution Steps:
				• If hostname for the Node Manager is in the file, and it
				is not meant for decommissioning, remove it.
				Information to Collect:
				• Files that are pointed to by the
				yarn.resourcemanager.nodes.exclude-path
				or yarn.resourcemanager.nodes.includepath
				properties.

Potential Root Cause: Node				Troubleshooting Steps:
was Decommissioned and/or				• This is a problem with HDFS. Refer to the HDFS
Reinserted into the Cluster				Troubleshooting Guide.

Potential Root Cause: Resource				Troubleshooting Steps:
Manager is Refusing the				• Follow the steps (described previously) to ensure that
Connection				the NameNode has started and is accepting requests.
				Resolution Steps:
				• Ensure that the DataNode is not in the "exclude" list.
				• Ensure that the DataNode host is in the "include" list.
				Information to Collect:
				• NameNode slaves file.
				• NameNode hosts.deny file (or the file specified as
				the blacklist in HDFS configuration).
				• NameNode hosts.allow file (or the file specified as
				the whitelist in HDFS configuration).
				• HDFS Configuration.


Potential Root Cause:				Troubleshooting Steps:
NameNode is Refusing the				• Follow the steps (described previously) to ensure that
Connection				the NameNode has started and is accepting requests.
				Resolution Steps:
				• Ensure that the DataNode is not in the "exclude" list.
				Information to Collect:
				• NameNode slaves file.
				• NameNode hosts.deny file (or the file specified as
				the blacklist in HDFS configuration).
				• NameNode hosts.allow file (or the file specified as
				the whitelist in HDFS configuration).
				• HDFS Configuration.

Node Java Process Exited Abnormally -Server Side Issue


Potential Root Cause: Improper				Troubleshooting Steps:
Shutdown				• Investigate the OS history and the Hadoop audit logs.
				• Verify that no edit log or fsimage corruption
				occurred.
				Resolution Steps:
				• Investigate the cause, and take measures to prevent
				future occurrence.
				Information to Collect:
				• Hadoop audit logs.
				• Linux command: last (history).
				• Linux user command history.


Potential Root Cause: Incorrect				Troubleshooting Steps:
Memory Configuration				• Verify values in configuration files.
				• Check logs for stack traces -- out of heap space, or
				similar.
				Resolution Steps:
				• Fix configuration values and restart job/Resource
				Manager/Node Manager.
				Information to Collect:
				• Resource Manager.
				• Node Manager.
				• MapReduce v2 configuration files.

Resource Manager or Node Manager: Fails to Start or Crashes -Server Side Issue

Symptoms may include:
• Process appears to start, but then disappears from the process list.
• Node Manager cannot bind to interface.
• Kernel Panic, Halt.

Potential Root Cause: Existing				Troubleshooting Steps:
Process Bound to Port				• Examine bound ports to ensure no other process has
				already bound.

				Resolution Steps:
				• Resolve the port conflict before attempting to restart
				the Resource Manager/Node Manager.

				Information to Collect:
				• List of bound interfaces/ports and the process.
				• Resource Manager log

Potential Root Cause: Incorrect				Troubleshooting Steps:
File Permissions				• Verify that all Hadoop file system permissions are set
				properly.
				• Verify the Hadoop configurations.

				Resolution Steps:
				• Follow the procedures for handling failure due to file
				permissions (see Hortonworks KB Solutions/Articles).
				• Fix any incorrect configuration.

				Information to Collect:
				• Dump of file system permissions, ownership,
				and flags -- by looking in the configuration
				value in the yarn-site.xml file for the
				yarn.nodemanager.local-dirs property. In this
				case, it has a value of “/hadoop/yarn/local”. From the
				command line, run:
				ls -lR /hadoop/yarn/local
				• Resource Manager log.
				• Node Manager log.

Potential Root Cause: Incorrect				Troubleshooting Steps:
Name-to-IP Resolution				• Verify that the name/IP resolution is correct for all
				nodes in the cluster.

				Resolution Steps:
				• Fix any incorrect configuration.
				Information to Collect:
				• Local hosts file for all hosts on the system (/etc/
				hosts).
				• Resolver configuration (/etc/resolv.conf).
				• Network configuration (/etc/sysconfig/
				network-scripts/ifcfg-ethX where X =
				number of interface card).

Potential Root Cause: Java				Troubleshooting Steps:
Heap Space Too Low				• Examine the heap space property in yarn-env.sh
				• Examine the settings in Ambari cluster management.

				Resolution Steps:
				• Adjust the heap space property until the Resource
				Manager resumes running.
				Information to Collect:
				• yarn-env.sh from cluster.
				• Screen-shot of Ambari cluster management mapred
				settings screen.
				• Resource Manager log.
				• Node Manager log.

Potential Root Cause:				Troubleshooting Steps:
Permissions Not Set Correctly				• Examine the permissions on the various directories on
on Local File System				the local file system.
				• Verify proper ownership (yarn/mapred for
				MapReduce directories and hdfs for HDFS
				directories).

				Resolution Steps:
				• Use the chmod command to change the permissions
				of the directories to 755.
				• Use the chown command to assign the directories to
				the correct owner (hdfs or yarn/mapred).
				• Relaunch the Hadoop daemons using the correct
				user.

				Information to Collect:
				• core-site.xml, hdfs-site.xml, mapredsite.
				xml, yarn-site.xml
				• Permissions listing for the directories listed in the
				above configuration files.

Potential Root Cause:				Troubleshooting Steps:
Insufficient Disk Space				• Verify that there is sufficient space on all system, log,
				and HDFS partitions.
				• Run the df -k command on the Name/DataNodes
				to verify that there is sufficient capacity on the disk
				volumes used for storing NameNode or HDFS data.

				Resolution Steps:
				• Free up disk space on all nodes in the cluster.
				-OR-
				• Add additional capacity.
				Information to Collect:
				• Core dumps.
				• Linux command: last (history).
				• Dump of file system information.
				• Output of df -k command.

Potential Root Cause: Reserved				Troubleshooting Steps:
Disk Space is Set Higher than				• In hdfs-site.xml, check that the value of the
Free Space				dfs.datanode.du.reserved property is less than
				the available free space on the drive.

				Resolution Steps:
				• Configure an appropriate value, or increase free
				space.

				Information to Collect:
				• HDFS configuration files.

Sunday, February 5, 2017

Map Reduce Troubleshooting Actions Checklist

1. Understand the issue. Check to see if the issue exists with all MRv2 jobs. Determine when a particular job or script last worked successfully. Understand the actual and expected behavior and formulate a problem statement.

2. Verify that all components related to MRv2 are running. You can use the ps and jps commands to check to see if the processes for dependent components are running. Ensure that all ports are listening, are bound to a process, and accept connection (i.e., firewall issues).

3. Look at the job details in the Resource Manager UI.

a. Use the UI to navigate to the job attempt.

b. Look at the log file for the failed attempt.

4. Look at the Resource Manager and the Node Manager log files in the Resource Manager UI, or on the specific nodes.

5. Use the yarn logs command to collect all of the logs of the Containers.

6. Check the Job Configuration in the Resource Manager UI to make sure that all of the desired parameters were actually passed on to the job.

7. Run the MRv2 pi job provided with the HDP examples to see if that job succeeds:

a. If it succeeds, check to see if there is a problem with the client or the data.

b. If it fails there is probably some basic problem with the process or the configuration.

8. If the job is run through streaming or pipes, run a similar job to troubleshoot.

9. If the job is started by one of the other HDP components, look at the component-specific guide.

10. Look for the operating system information and verify that it is supported.

11. Search the Hortonworks Knowledge Base for a possible solution.

12. If the issue is still not resolved, log a case in the Hortonworks Support Portal:

a. Provide all of the information gathered in the preceding steps, along with the information in the “Checklist of Items to Collect” list in the following section.

b. Tar the configuration files and the log files and attach them to the case.

c. Inform Hortonworks if it is a Production, Development, or POC environment.

Checklist of Items to Collect

1. Collect the most recent log files for all of the MRv2 daemons.

2. Get copies of the configuration files(HDFS.xml,YARN.xml,Env,etc.)

3. Provide the number of Data Nodes in the cluster, as well as the total number of nodes.

4. Use the yarn logs command to collect the log files of the Containers for all of the tasks.

5. How was HDP installed -- with Ambari, or manually with RPM?

6. Provide hardware specifications: CPU, memory, disk drives, number of network interfaces.

Select Hadoop

Tuesday, February 14, 2017

Decommissioning Slave Nodes

Monday, February 13, 2017

Rack Awareness in Hadoop

Wednesday, February 8, 2017

Calculating the Capacity of a Node

Tuesday, February 7, 2017

Common Client-Side Issues

Monday, February 6, 2017

Error: Could Only be Replicated to x Nodes, Instead of n -Server Side Issue

Node Manager Denied Communication with Resource Manager -Server Side Issue

Node Java Process Exited Abnormally -Server Side Issue

Resource Manager or Node Manager: Fails to Start or Crashes -Server Side Issue

Sunday, February 5, 2017

Map Reduce Troubleshooting Actions Checklist

Checklist of Items to Collect

Kafka Architecture

Search This Blog