Open sourcing Kube2Hadoop: Secure access to HDFS from Kubernetes

Co-authors: Cong Gu, Abin Shahab, Chen QiangKeqiu Hu

Editor’s note: This post was updated on June 10, 2020

LinkedIn AI has been traditionally Hadoop/YARN based, and we operate one of the world’s largest Hadoop data lakes, with over 4,500 users and 500PB of data. In the last few years, Kubernetes has also become very popular at LinkedIn for Artificial Intelligence (AI) workloads. Adoption at the company started as a proof of concept for Jupyter notebooks, and it has now become a key piece of our model training and model serving infrastructure. 

By default, there is a gap between the security model of Kubernetes and Hadoop. Specifically, Hadoop uses Kerberos, a three-party protocol built on symmetric key cryptography to ensure any clients accessing the cluster are who they claim to be. In order to avoid frequent authentication checks against a Kerberos server, Delegation Tokens, a lightweight two-party authentication method, was introduced to complement Kerberos authentication. The Hadoop delegation token by default has a lifespan of one day and can be renewed for up to seven days. Kubernetes, on the other hand, uses a certificate-based approach for authentication, and does not expose the owner of a job in any of its public-facing APIs. Therefore, it is not possible to securely determine the authorized user from within the pod using the native Kubernetes API and then use that username to fetch the Hadoop delegation token for HDFS access.  

To allow for Kubernetes workloads to securely access HDFS, we built Kube2Hadoop, a scalable and secure integration with HDFS Kerberos. This enables AI modelers at LinkedIn to use HDFS data in Kubernetes pods with access control through a user account or a headless account. Headless accounts are oftentimes used to denote a virtual team that is working on projects that would share the same data within the team. The data acquired can then be used in their model exploration and training with KubeFlow components such as the tf-operator and mpi-operator. In this blog, we will describe the design and authentication model of Kube2Hadoop. 

Open source

Since the introduction of Hadoop to the open source community, HDFS has been a widely-adopted distributed file system in the industry for its scalability and robustness. With the growing popularity in running model training on Kubernetes, it is natural for many people to leverage the massive amount of data that already exists in HDFS. We think that Kube2Hadoop will benefit both the Kubernetes and Hadoop communities. And today, we are also pleased to announce that we are open sourcing this solution! You can find the source code available in our Github repository

How does Kube2Hadoop work?

To ensure secure HDFS access for AI workloads running on Kubernetes at LinkedIn, we built a Kubernetes-native solution: Kube2Hadoop. It consists of three parts: 

  1. Hadoop Token Service, for fetching delegation tokens, deployed as a Kubernetes Deployment;
  2. Kube2Hadoop Init Container in each worker pod as a client for sending requests to fetch a delegation token from Hadoop Token Service;
  3. IDDecorator (see further below) for writing an authenticated user-ID deployed as a Kubernetes Admission Controller.

The following diagram shows an overview of what a typical workflow would look like for the user:

  1. User performs a login to a Hadoop Gateway machine using their own credentials.
  2. User receives a certificate from a client authentication service.
  3. Using the obtained certificate, the user submits a job on the gateway to Kubernetes (K8s) cluster.
  4. The K8s API Server authenticates the user with the certificate and launches containers as requested. The Kube2Hadoop init containers are attached to each of the worker containers that require HDFS access. The init container then sends a request to the Hadoop Token Service for the delegation token.
  5. The Token Service, which acts as a Hadoop superuser (contains a superuser keytab), proxies as the user to fetch the delegation token.
  6. The returned token is mounted locally in the container.
  7. Once the training starts for the workers, they can seamlessly access HDFS using the fetched token.
  8. The Hadoop Token Service puts a watch on the status of each job to cancel the token when the job finishes and renew the token for long-running jobs.

Source link