Thursday, March 15, 2018

Apache Ranger

HDFS is core part of any Hadoop deployment and in order to ensure that data is protected in Hadoop platform, security needs to be baked into the HDFS layer. HDFS is protected using Kerberos authentication, and authorization using POSIX style permissions/HDFS ACLs or using Apache Ranger.

Apache Ranger is a centralized security administration solution for Hadoop that enables administrators to create and enforce security policies for HDFS and other Hadoop platform components.

How Ranger policies work for HDFS?

In order to ensure security in HDP environments, we recommend all of our customers to implement Kerberos, Apache Knox and Apache Ranger.

Apache Ranger offers a federated authorization model for HDFS. Ranger plugin for HDFS checks for Ranger policies and if a policy exists, access is granted to the user. If a policy doesn’t exist in Ranger, then Ranger would default to native permissions model in HDFS (POSIX or HDFS ACL). This federated model is applicable for HDFS and Yarn service in Ranger.



For other services such as Hive or HBase, Ranger operates as the sole authorizer which means only Ranger policies are in effect. The option for the fallback model is configured using a property in Ambari → Ranger → HDFS config → Advanced ranger-hdfs-security

The federated authorization model enables customers to safely implement Ranger in an existing cluster without affecting jobs which rely on POSIX permissions. We recommend enabling this option as the default model for all deployments.

Ranger’s user interface makes it easy for administrators to find the permission (Ranger policy or native HDFS) that provides access to the user. Users can simply navigate to Ranger→ Audit and look for the values in the enforcer column of the audit data. If the populated value in Access Enforcer column is “Ranger-acl”, it indicates that a Ranger policy provided access to the user. If the Access Enforcer value is “Hadoop-acl”, then the access was provided by native HDFS ACL or POSIX permission.


BEST PRACTICES FOR HDFS AUTHORIZATION
Having a federated authorization model may create a challenge for security administrators looking to plan a security model for HDFS.

After Apache Ranger and Hadoop have been installed, we recommend administrators to implement the following steps:

Change HDFS umask to 077
Identify directory which can be managed by Ranger policies
Identify directories which need to be managed by HDFS native permissions
Enable Ranger policy to audit all records
Here are the steps again in detail.

1. Change HDFS umask to 077 from 022. This will prevent any new files or folders to be accessed by anyone other than the owner
Administrators can change this property via Ambari:


The umask default value in HDFS is configured to 022, which grants all the users read permissions to all HDFS folders and files. You can check by running the following command in recently installed Hadoop

$ hdfs dfs -ls /apps
Found 3 items
drwxrwxrwx   – falcon hdfs       0 2015-11-30 08:02 /apps/falcon
drwxr-xr-x   – hdfs   hdfs           0 2015-11-30 07:56 /apps/hbase
drwxr-xr-x   – hdfs   hdfs           0 2015-11-30 08:01 /apps/hive

2. How to identify the directories that can be managed by Ranger policies?

We recommend that permission for application data folders (/apps/hive, /apps/Hbase), as well as any custom data folders, be managed through Apache Ranger. The HDFS native permissions for these directories need to be restrictive. This can be done through changing permissions in HDFS using chmod.

Example:

$ hdfs dfs -chmod -R 000 /apps/hive
$ hdfs dfs -chown -R hdfs:hdfs /apps/hive
$ hdfs dfs -ls /apps/hive
Found 1 items
d———   – hdfs hdfs          0 2015-11-30 08:01 /apps/hive/warehouse

Then navigate to Ranger admin and give explicit permission to users as needed. For example:


Administrators should follow the same process for other data folders as well. You can validate  whether your changes are in effect by doing the following:

Connect to HiveServer2 using beeline
Create a table
create table employee( id int, name String, ssn String);
Go to the ranger, and check the HDFS access audit. The enforcer should be ‘ranger-acl’

3. Identify directories which can be managed by HDFS permissions. It is recommended to let HDFS manage the permissions for /tmp and /user folders. These are used by applications and jobs which create user level directories.

Here, you should also set the initial permission for /user folder  to “700”, similar to the example below

 hdfs dfs -ls /user
Found 4 items
drwxrwx—   – ambari-qa hdfs          0 2015-11-30 07:56 /user/ambari-qa
drwxr-xr-x   – hcat      hdfs          0 2015-11-30 08:01 /user/hcat
drwxr-xr-x   – hive      hdfs          0 2015-11-30 08:01 /user/hive
drwxrwxr-x   – oozie     hdfs          0 2015-11-30 08:02 /user/oozie

$ hdfs dfs -chmod -R 700 /user/*
$ hdfs dfs -ls /user
Found 4 items
drwx——   – ambari-qa hdfs          0 2015-11-30 07:56 /user/ambari-qa
drwx——   – hcat      hdfs          0 2015-11-30 08:01 /user/hcat
drwx——   – hive      hdfs          0 2015-11-30 08:01 /user/hive
drwx——   – oozie     hdfs          0 2015-11-30 08:02 /user/oozie

4.  Ensure auditing for all HDFS data.
Auditing in Apache Ranger can be controlled as a policy. When Apache Ranger is installed through Ambari, a default policy is created for all files and directories in HDFS and with auditing option enabled.This policy is also used by Ambari smoke test user “ambari-qa” to verify HDFS service through Ambari. If administrators disable this default policy, they would need to create a similar policy for enabling audit across all files and folders.

Summary:

Securing HDFS files through permissions is a starting point for securing Hadoop. Ranger provides a centralized interface for managing security policies for HDFS. Security administrators are recommended to use a combination of HDFS native permissions and Ranger policies to provide comprehensive coverage for all potential use cases. Using the best practices outlined in this blog, administrators can simplify the access control policies for administrative and user directories, files in HDFS.

Thursday, March 1, 2018

Hadoop Admin Interview Question Answer -3

Q 1. In Hadoop ecosystem, we have HDFS, Zookeeper, Yarn/ Mapreduce2, Hive, spark, oozie. What is the sequence of start the service from first to last?
Ans: Zookeeper, HDFS, Mapredure2/Yarn, hive, spark...

Q 2. What are services you use for Authentication and Authorization
Ans: We use Kerberos for Authentication and ACL for Authorization.

Q 3. What is the size of your cluster and what are the services you use.
Ans: Cluster having 10 hosts
6 datanode, 2 Edge Node, 2 NameNode
6 hosts of 12 TB each
Blocksize=64 MB, Replication=3
12 TB * 6 Host = 72 TB
Cluster capacity in MB: 72 * 1000000 MB = 72,000,000 MB
Disk space needed per block: 64 MB per block * 3 = 192 MB storage per block
Total number of blocks: 72,000,000  / 192  = 375000 blocks
70% of total capacity i.e. 72 TB= 39 TB
Actual data = 13 TB
We can say 20-35 GB per day data. and keep last 12 months of data.

Note: Kindly correct it, if I am wrong. This is the only sketch

Q 4. What is the architecture of Hive?
Ans: https://selecthadoop.blogspot.in/search/label/Hive

Q 5. What are the Producer, Consumers, broker in Kafka?

Q 6. Execution of Hadoop Job.
Ans: 1. The client application submits a job to the resource manager.
2. The resource manager takes the job from the job queue and allocates it to an application master. It also manages and monitors resource allocations to each application master and container on the data nodes.
3. The application master divides the job into tasks and allocates it to each data node.
4. On each data node, a Node manager manages the containers in which the tasks run.
5. The application master will ask the resource manager to allocate more resource to particular containers, if necessary.
6. The application master will keep the resource manager informed as to the status of the jobs allocated to it, and the resource manager will keep the client application informed.

Q 7. What are the components of YARN.?

Q 8. What are your roles and responsibilities?
Ans: https://selecthadoop.blogspot.in/search/label/Daily%20Activities%20of%20Hadoop%20Admin

Q 9. What happens with the active namenode when any standby namenode become active?
Ans: Hi Hadoopers kindly help

Q 10.       What are the views in Amabari? Is the file views browse the local directory and Hadoop directory?
Ans: Ambari provides the UI for executing Hive query, pig scripts, transfer files from local to HDFS and vice versa. Yes, file views browse the local directory and Hadoop directory.

Q 11. How Hadoop save 100 MB of a file if your block size is 64 MB?
Ans: Hadoop save 100 MB of a file in 2 blocks, the first block size is of 64 MB and second block size is of 36 MB only.
Hadoop stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.

Q 12. If we have 4 datanode and the replication factor is 3. So how we decommission the 2 datanodes from the cluster?
Ans:

Q 13. What are the steps of upgrade the Hadoop cluster? What are the changes to be made?
Ans:

Q 14: What happens when we do not give the snapshot name.
Ans: The snapshot name, which is an optional argument. When it is omitted, a default name is generated using a timestamp with the format "'s'yyyyMMdd-HHmmss.SSS", e.g. "s20130412-151029.033".

Q 15: In the kerberized Hadoop cluster, What are troubleshooting steps when any user is unable to login into the cluster.
Ans:

Q 16: Why do we use HDFS for applications having large data sets and not when there are lot of small files?
Ans: HDFS is more suitable for a large number of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high-performance system, so it is not prudent to occupy the space in the Namenode by an unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.

Q 17: Web-UI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to remove those nodes from the network?
Ans: This means that namenode is trying to retrieve data from those datanodes by moving replicas to remain datanodes. There is a possibility that data can be lost if administrator removes those datanodes before decommissioning finished.Due to replication strategy, it is possible to lose some data due to datanodes removal en masse prior to completing the decommissioning process. Decommissioning refers to namenode trying to retrieve data from datanodes by moving replicas to remain datanodes.

Kafka Architecture

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you t...