Monday, February 6, 2017

Resource Manager or Node Manager: Fails to Start or Crashes -Server Side Issue

Symptoms may include:
• Process appears to start, but then disappears from the process list.
• Node Manager cannot bind to interface.
• Kernel Panic, Halt.
Potential Root Cause: Existing Troubleshooting Steps:
Process Bound to Port • Examine bound ports to ensure no other process has
already bound.
Resolution Steps:
• Resolve the port conflict before attempting to restart
the Resource Manager/Node Manager.
Information to Collect:
• List of bound interfaces/ports and the process.
• Resource Manager log
Potential Root Cause: Incorrect Troubleshooting Steps:
File Permissions • Verify that all Hadoop file system permissions are set
properly.
• Verify the Hadoop configurations.
Resolution Steps:
• Follow the procedures for handling failure due to file
permissions (see Hortonworks KB Solutions/Articles).
• Fix any incorrect configuration.
Information to Collect:
• Dump of file system permissions, ownership,
and flags -- by looking in the configuration
value in the yarn-site.xml file for the
yarn.nodemanager.local-dirs property. In this
case, it has a value of “/hadoop/yarn/local”. From the
command line, run:
ls -lR /hadoop/yarn/local
• Resource Manager log.
• Node Manager log.
Potential Root Cause: Incorrect Troubleshooting Steps:
Name-to-IP Resolution • Verify that the name/IP resolution is correct for all
nodes in the cluster.
Resolution Steps:
• Fix any incorrect configuration.
Information to Collect:
• Local hosts file for all hosts on the system (/etc/
hosts).
• Resolver configuration (/etc/resolv.conf).
• Network configuration (/etc/sysconfig/
network-scripts/ifcfg-ethX where X =
number of interface card).
Potential Root Cause: Java Troubleshooting Steps:
Heap Space Too Low • Examine the heap space property in yarn-env.sh
• Examine the settings in Ambari cluster management.
Resolution Steps:
• Adjust the heap space property until the Resource
Manager resumes running.
Information to Collect:
• yarn-env.sh from cluster.
• Screen-shot of Ambari cluster management mapred
settings screen.
• Resource Manager log.
• Node Manager log.
Potential Root Cause: Troubleshooting Steps:
Permissions Not Set Correctly • Examine the permissions on the various directories on
on Local File System the local file system.
• Verify proper ownership (yarn/mapred for
MapReduce directories and hdfs for HDFS
directories).
Resolution Steps:
• Use the chmod command to change the permissions
of the directories to 755.
• Use the chown command to assign the directories to
the correct owner (hdfs or yarn/mapred).
• Relaunch the Hadoop daemons using the correct
user.
Information to Collect:
• core-site.xml, hdfs-site.xml, mapredsite.
xml, yarn-site.xml
• Permissions listing for the directories listed in the
above configuration files.
Potential Root Cause: Troubleshooting Steps:
Insufficient Disk Space • Verify that there is sufficient space on all system, log,
and HDFS partitions.
• Run the df -k command on the Name/DataNodes
to verify that there is sufficient capacity on the disk
volumes used for storing NameNode or HDFS data.
Resolution Steps:
• Free up disk space on all nodes in the cluster.
-OR-
• Add additional capacity.
Information to Collect:
• Core dumps.
• Linux command: last (history).
• Dump of file system information.
• Output of df -k command.
Potential Root Cause: Reserved Troubleshooting Steps:
Disk Space is Set Higher than • In hdfs-site.xml, check that the value of the
Free Space dfs.datanode.du.reserved property is less than
the available free space on the drive.
Resolution Steps:
• Configure an appropriate value, or increase free
space.
Information to Collect:
• HDFS configuration files.

No comments:

Post a Comment

Kafka Architecture

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you t...