How LinkedIn customizes Apache Kafka for 7 trillion messages per day

LinkedIn’s development workflow

The most important question to answer here is whether to choose the Upstream First route or LinkedIn First route (shown as “Commit to upstream first?” in the flowchart). Based on the urgency of the patch, the author should carefully assess the tradeoffs of both approaches. Typically, patches addressing production issues are committed as hotfixes first, unless they can be committed to upstream quickly (like within a week) and are small enough to be cherry-picked immediately. Feature patches for approved KIPs should go to the upstream branch first.

Patch examples

In this section, we present some of our representative patches, either made to upstream or ones that remain as LinkedIn-internal hotfixes. For patches discussed in the sections below, we plan to attempt upstreaming these patches if this has not already occurred.

Scalability improvements

At LinkedIn, some larger clusters have more than 140 brokers and host one million replicas in a single cluster. With those large clusters, we experienced issues related to slow controllers and controller failure caused by memory pressure. Such issues have a serious impact on production and may cause cascading controller failure, one after another. We introduced several hotfix patches to mitigate those issues—for example, reducing controller memory footprint by reusing UpdateMetadataRequest objects and avoiding excessive logging.

As we increased the number of brokers in a cluster, we also realized that slow startup and shutdown of a broker can cause significant deployment delays for large clusters. This is because we can only take down one broker at a time for deployment to maintain the availability of the Kafka cluster. To address this deployment issue, we added several hotfix patches to reduce startup and shutdown time of a broker (e.g., a patch to improve shutdown time by reducing lock contention). 

Operational improvements

These types of patches are developed to resolve operational issues that arise with Kafka deployments. For example, SREs frequently remove bad brokers (e.g., brokers with a slow/bad disk) from and add new brokers to clusters. During broker removal, we want to maintain the same level of data redundancy to avoid the risk of data loss. To achieve this goal, SREs need to move replicas out of the broker that is going to be removed, prior to the actual removal. However, moving all replicas out of a broker turns out to be very difficult, because new topics are constantly created and may assign replicas on that broker. To address this problem, we introduced the maintenance mode for brokers. When a broker becomes a maintenance broker, it does not get assigned new topic partitions/replicas anymore. This feature enables us to easily move all replicas from a broker to another, and then cleanly take down a broker. 

New features and direct contributions to upstream

With the upstream-first approach mentioned above, we contribute directly to upstream and later bring patches back into LinkedIn when a new release branch including those patches is created. Some of the recent major contributions from LinkedIn to upstream include:

  • KIP-219: Improve quota communication
  • KIP-380: Detect outdated control requests and bounced brokers using broker generation
  • KIP-291: Separating controller connections and requests from the data plane
  • KIP-354: Add a Maximum Log Compaction Lag

We also have added several new features that do not already exist in Apache Kafka, including: 

Creating a new release branch

So far, we have presented examples of patches or features that are included in the LinkedIn Kafka release branches. You may now wonder how LinkedIn creates a new release branch. We start with branching off from an Apache Kafka release branch (e.g., 2.3.0 branch to create LinkedIn Kafka 2.3.0.x branch). After that, we move hotfix patches from the previous LinkedIn release branch (e.g., 2.0.0.x branch) that are yet to be committed upstream, to the new LinkedIn Kafka branch. The diagram below depicts this process:

Source link