Thursday, February 22, 2018

Different ways to start Hadoop daemon processes

Different ways to start hadoop daemon processes and what is the difference between them usually newbies know how to start hadoop processes but they don't know the differences among them.

So basically Hadoop processes can be started or stop in three ways:

1- start-all.sh and stop-all.sh
2- start.dfs.sh, stop.dfs.sh and start-yarn.sh, stop-yarn.sh
3- hadoop.daemon.sh start namenode/datanode and hadoop.daemon.sh stop namenode/datanode

Differences


1- start-all.sh and stop-all.sh: Used to start and stop hadoop daemons all at once. Issuing it on the master machine will start/stop the daemons on all the nodes of a cluster.

2- start.dfs.sh, stop.dfs.sh and start-yarn.sh, stop-yarn.sh: Same as above but start/stop HDFS and YARN daemons separately from the master machine on all the nodes. It is advisable to use these commands now over start-all.sh & stop-all.sh

3- hadoop.daemon.sh start namenode/datanode and hadoop.daemon.sh stop namenode/datanode: To start individual daemons on an individual machine manually. You need to go to a particular node and issue these commands.

Use case: Suppose you have added a new datanode to your cluster and you need to start the datanode daemon only on this machine

$HADOOP_HOME/bin/hadoop-daemon.sh start datanode

Thursday, February 15, 2018

Troubleshooting Actions Checklist for Hadoop Job

1. Understand the issue. Check to see if the issue exists with all MRv2 jobs. Determine when a particular job or script last worked successfully. Understand the actual and expected behavior and formulate a problem statement.

2. Verify that all components related to MRv2 are running. You can use the ps and jps commands to check to see if the processes for dependent components are running. Ensure that all ports are listening, are bound to a process, and accept the connection (i.e., firewall issues).

3. Look at the job details in the Resource Manager UI.
a. Use the UI to navigate to the job attempt.
b. Look at the log file for the failed attempt.

4. Look at the Resource Manager and the Node Manager log files in the Resource Manager UI, or on the specified nodes.

5. Use the yarn logs command to collect all of the logs of the Containers.

6. Check the Job Configuration in the Resource Manager UI to make sure that all of the desired parameters were actually passed on to the job.

7. Run the MRv2 pi job provided with the HDP examples to see if that job succeeds:
• If it succeeds, check to see if there is a problem with the client or the data.
• If it fails there is probably some basic problem with the process or the configuration.

8. If the job is run through streaming or pipes, run a similar job to troubleshoot.

9. If the job is started by one of the other HDP components, look at the component-specific guide.

10. Look for the operating system information and verify that it is supported.

11. Search the Hortonworks Knowledge Base for a possible solution.

12. If the issue is still not resolved, log a case in the Hortonworks Support Portal:
a. Provide all of the information gathered in the preceding steps, along with the information in the “Checklist of Items to Collect” list in the following section.
b. Tar the configuration files and the log files and attach them to the case.
c. Inform Hortonworks if it is a Production, Development, or POC environment.

Checklist of Items to Collect:


1. Collect the most recent log files for all of the MRv2 daemons.

2. Get copies of the following configuration files:

3. Provide the number of Data Nodes in the cluster, as well as the total number of nodes.

4. Use the yarn logs command to collect the log files of the Containers for all of the tasks.

5. How was HDP installed -- with Ambari, or manually with RPM?

6. Provide hardware specifications: CPU, memory, disk drives, number of network interfaces.

Saturday, February 10, 2018

Authorization and Authentication In Hadoop

Authentication

Authentication is the process of verifying the identity of a user by obtaining some sort of credentials and using those credentials to verify the user's identity. If the credentials are valid, the authorization process starts. Authentication process always proceeds to Authorization process.

If Hadoop is configured with all of its defaults, Hadoop doesn’t do any authentication of users. This is an important realization to make, because it can have serious implications in a corporate data center. Let’s look at an example of this.

Let’s say Joe User has access to a Hadoop cluster. The cluster does not have any Hadoop security features enabled, which means that there are no attempts made to verify the identities of users who interact with the cluster. The cluster’s superuser is hdfs, and Joe doesn’t have the password for the hdfs user on any of the cluster servers. However, Joe happens to have a client machine which has a set of configurations that will allow Joe to access the Hadoop cluster, and Joe is very disgruntled. He runs these commands:

sudo useradd hdfs
sudo -u hdfs hadoop fs -rmr /

The cluster goes off and does some work, and comes back and says “Ok, hdfs, I deleted everything!”.

So what happened here? Well, in an insecure cluster, the NameNode and the JobTracker don’t require any authentication. If you make a request, and say you’re hdfs or mapred, the NN/JT will both say “ok, I believe that,” and allow you to do whatever the hdfs or mapred users have the ability to do.

Hadoop has the ability to require authentication, in the form of Kerberos principals. Kerberos is an authentication protocol which uses “tickets” to allow nodes to identify themselves. If you need a more in depth introduction to Kerberos, I strongly recommend checking out the Wikipedia page.

Hadoop can use the Kerberos protocol to ensure that when someone makes a request, they really are who they say they are. This mechanism is used throughout the cluster. In a secure Hadoop configuration, all of the Hadoop daemons use Kerberos to perform mutual authentication, which means that when two daemons talk to each other, they each make sure that the other daemon is who it says it is. Additionally, this allows the NameNode and JobTracker to ensure that any HDFS or MR requests are being executed with the appropriate authorization level.

Authorization 

Authorization is the process of allowing an authenticated users to access the resources by checking whether the user has access rights to the system. Authorization helps you to control access rights by granting or denying specific permissions to an authenticated user.

Authorization is a much different beast than authentication. Authorization tells us what any given user can or cannot do within a Hadoop cluster, after the user has been successfully authenticated. In HDFS this is primarily governed by file permissions.

HDFS file permissions are very similar to BSD file permissions. If you’ve ever run ls -l in a directory, you’ve probably seen a record like this:

drwxr-xr-x  2 natty hadoop  4096 2012-03-01 11:18 foo
-rw-r--r--  1 natty hadoop    87 2012-02-13 12:48 bar

On the far left, there is a string of letters. The first letter determines whether a file is a directory or not, and then there are three sets of three letters each. Those sets denote owner, group, and other user permissions, and the “rwx” are read, write, and execute permissions, respectively. The “natty hadoop” portion says that the files are owned by natty, and belong to the group hadoop. As an aside, a stated intention is for HDFS semantics to be “Unix-like when possible.” The result is that certain HDFS operations follow BSD semantics, and others are closer to Unix semantics.

The real question here is: what is a user or group in Hadoop? The answer is: they’re strings of characters. Nothing more. Hadoop will very happily let you run a command like

hadoop fs -chown fake_user:fake_group /test-dir

The downside to doing this is that if that user and group really don’t exist, no one will be able to access that file except the superusers, which, by default, includes hdfs, mapred, and other members of the hadoop supergroup.

In the context of MapReduce, the users and groups are used to determine who is allowed to submit or modify jobs. In MapReduce, jobs are submitted via queues controlled by the scheduler. Administrators can define who is allowed to submit jobs to particular queues via MapReduce ACLs. These ACLs can also be defined on a job-by-job basis. Similar to the HDFS permissions, if the specified users or groups don’t exist, the queues will be unusable, except by superusers, who are always authorized to submit or modify jobs.

The next question to ask is: how do the NameNode and JobTracker figure out which groups a user belongs to?

When a user runs a hadoop command, the NameNode or JobTracker gets some information about the user running that command. Most importantly, it knows the username of the user. The daemons then use that username to determine what groups the user belongs to. This is done through the use of a pluggable interface, which has the ability to take a username and map it to a set of groups that the user belongs to. In a default installation, the user-group mapping implementation forks off a subprocess that runs id -Gn [username]. That provides a list of groups like this:

natty@vorpal:~/cloudera $ id -Gn natty
natty adm lpadmin netdev admin sambashare hadoop hdfs mapred

The Hadoop daemons then use this list of groups, along with the username to determine if the user has appropriate permissions to access the file being requested. There are also other implementations that come packaged with Hadoop, including one that allows the system to be configured to get user-group mappings from an LDAP or Active Directory systems. This is useful if the groups necessary for setting up permissions are resident in an LDAP system, but not in Unix on the cluster hosts.

Something to be aware of is that the set of groups that the NameNode and JobTracker are aware of may be different than the set of groups that a user belongs to on a client machine. All authorization is done at the NameNode/JobTracker level, so the users and groups on the DataNodes and TaskTrackers don’t affect authorization, although they may be necessary if Kerberos authentication is enabled. Additionally, it is very important that the NameNode and the JobTracker both be aware of the same groups for any given user, or there may be undefined results when executing jobs. If there’s ever any doubt of what groups a user belongs to, hadoop dfsgroups and hadoop mrgroups may be used to find out what groups that a user belongs to, according to the NameNode and JobTracker, respectively.

Kafka Architecture

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you t...