Kerberos is an authentication protocol that is now used as a standard to implement authentication in the Hadoop cluster.
Hadoop, by default, does not do any authentication, which can have severe effects on the corporate data centers. To overcome this limitation, Kerberos which provides a secure way to authenticate users was introduced in the Hadoop Ecosystem.
Kerberos is the network authentication protocol developed at MIT, which uses “tickets” to allow nodes to identify themselves.
Hadoop uses the Kerberos protocol to ensure that someone who is making the request is the one who he claims to be.
In the secure mode, all Hadoop nodes use Kerberos to do mutual authentication. It means that when two nodes talk to each other, they each make sure that the other node is who it says it is.
Kerberos uses secret-key cryptography for providing authentication for client-server applications.
Kerberos in Hadoop
The client makes the three steps while using Hadoop with Kerberos.
Authentication: In Kerberos, the client first authenticates itself to the authentication server. The authentication server provides the timestamped Ticket-Granting Ticket (TGT) to the client.
Authorization: The client then uses TGT to request a service ticket from the Ticket-Granting Server.
Service Request: On receiving the service ticket, the client directly interacts with the Hadoop cluster daemons such as NameNode and ResourceManager.
Authentication server and Ticket Granting Server together form the Key Distribution Center (KDC) of Kerberos.
The client on the user’s behalf performs the authorization and the service request steps.
The authentication step is carried out by the user through the kinit command, which will ask for a password.
We don’t need to enter a password every time while running a job because Ticket-Granting Ticket lasts for 10 hours by default, which is renewable up to a week.
If we don’t want ourselves to get a prompt for the password, we can create a Kerberos keytab file using ktutil command.
The keytab file stores passwords supplied to knit with the -t option.
2. Transparent Encryption in HDFS
For data protection, Hadoop HDFS implements transparent encryption. Once it is configured, the data that is to be read from and written to the special HDFS directories is encrypted and decrypted transparently without requiring any changes to the user application code.
This encryption is end-to-end encryption, which means that only the client will encrypt or decrypt the data. Hadoop HDFS will never store or have access to unencrypted data or unencrypted data encryption keys, satisfying at-rest encryption, and in-transit encryption.
At-rest encryption refers to the encryption of data when data is on persistent media such as a disk.
In-transit encryption means encryption of data when data is traveling over the network.
HDFS encryption enables the existing Hadoop applications to run transparently on the encrypted data.
This HDFS-level encryption also prevents the filesystem or OS-level attacks.
Architecture Design
Encryption Zone(EZ): It is a special directory whose content upon write is encrypted transparently, and during read, the content is transparently decrypted.
Encryption Zone Key (EZK): Every Encryption Zone key has a EZK specified during zone creation.
Data Encryption Key (DEK): Every file in EZ has its own unique DEK, which is never handled directly by HDFS. They are used to encrypt and decrypt the file data.
Encrypted Data Encryption Key(EDEK): HDFS handles EDEK. The client decrypts the EDEK and then uses the corresponding DEK to read/write data.
Key Management Server(KMS): The KMS is responsible for providing access to the stored EZK, generating new EDEK for storage on NameNode, and decrypting the EDEK for use by the HDFS clients.
The transparent encryption in HDFS works in the following manner:
While creating a new file in EZ, the NameNode asks Key Management Server (KMS) to create a new Encrypted Data Encryption Key encrypted with EZk.
This EDEK is stored on the NameNode as part of the file’s metadata.
During file read within the encryption zone, NameNode provides the file’s EDEK along with the EZK version used to encrypt the EDEK to the client.
The client then asks KMS to decrypt the EDEK. KMS first checks whether the client has permission to access the encryption zone key version or not. If the client has access permission, it uses the DEK to decrypt the file’s content.
All these steps take place automatically through the Hadoop HDFS client, the NameNode, and the KMS interactions.
3. HDFS file and directory permission
For authorizing the user, the Hadoop HDFS checks the files and directory permission after the user authentication.
The HDFS permission model is very similar to the POSIX model. Every file and directory in HDFS is having an owner and a group.
The files or directories have different permissions for the owner, group members, and all other users.
For files, r is for reading permission, w is for write or append permission.
For directories, r is the permission to list the content of the directory, w is the permission to create or delete files/directories, and x is the permission to access a child of the directory.
To restrict others except for the files/directory owner and the superuser, from deleting or moving the files within the directory, we can add a sticky bit on directories.
The owner of the file/directory is the user identity of the client process, and the group of file/directory is the parent directory group.
Also, every client process which is going to access the HDFS has a two-part identity that is a user name and group list.
The HDFS do a permission check for the file or directory accessed by the client as follow:
If the user name of the client access process matches the owner of file or directory, then HDFS perform the test for the owner permissions;
If the group of file/directory matches any of member of the group list of the client access process, then HDFS perform the test for the group permissions;
Otherwise, the HDFS tests the other permissions of files/directories.
If the permissions check fails, then the client operation fails.
Wait!! Before start working on Hadoop do explore 15 Must-Know Hadoop Ecosystem Components.
Tools for Hadoop Security
The Hadoop ecosystem contains some tools for supporting Hadoop Security. The two major Apache open-source projects that support Hadoop Security are Knox and Ranger.
1. Knox
Knox is a REST API base perimeter security gateway that performs authentication, support monitoring, auditing, authorization management, and policy enforcement on Hadoop clusters. It authenticates user credentials generally against LDAP and Active Directory. It allows only the successfully authenticated users to access the Hadoop cluster.
2. Ranger
It is an authorization system that provides or denies access to Hadoop cluster resources such as HDFS files, Hive tables, etc. based on predefined policies. User request assumes to be already authenticated while coming to Ranger. It has different authorization functionality for different Hadoop components such as YARN, Hive, HBase, etc."