Evolution of Couchbase at LinkedIn


Transition to a dedicated team model

In 2017, a dedicated team was finally funded, and I quickly moved on the chance to work on Couchbase full-time. We were officially called the Caching as a Service team, or the CaaS team, but most people just referred to us as the Couchbase SRE team.

We had three main charters:

  • Centralize management. We would own and operate centralized Couchbase clusters and offer Couchbase as a service for any team to use. We also purchased Enterprise Edition, so that we could run a properly supported version of Couchbase in production and also get access to the Couchbase support team. We would migrate all existing team-owned Community Edition clusters onto our platform.

  • Enhance security. We would integrate with LinkedIn’s existing security libraries and use certificate-based authentication for access to Couchbase buckets. We actually worked with Couchbase for this feature, and certificate-based (X.509) authentication was added into Couchbase Server 5.0.

  • Improve Cost to serve. We would decrease our hardware footprint by better packing buckets into multitenant clusters and use SSDs (instead of additional nodes) to further reduce cost.

The team hit the ground running in April 2017, and since then, Ben Weir, Usha Kuppuswamy, Todd Hendricks, Subhas Sinha, and myself, led by Hardik Kheskani, have advanced Couchbase at LinkedIn significantly. Notably:

  • We’ve migrated more than 50 different use cases from legacy Community Edition Couchbase Clusters to our CaaS platform. We’ve migrated more than 10 different use cases from Memcached (yes, we still had a few stragglers) to our platform. And during all of this, we also on-boarded more than 15 different new use cases who were not using Couchbase before but wanted to use Couchbase.

  • We’ve integrated Couchbase more deeply with our deployment system, properly utilizing our in-house topology, deployment system, and configuration management. In the past, Couchbase was handled as a special snowflake using tooling like Salt and Range.

  • We’ve completely automated upgrades. For the longest time, the majority of our Couchbase clusters were stuck on Couchbase Server 2.2.0 Community because of the lack of necessary automation to upgrade the cluster safely. We’ve invested in our tooling so that we can notate in our configs the version of Couchbase we want and our tooling will gracefully failout each node out of the cluster, upgrade Couchbase server, and rebalance it back into the cluster, completely without human intervention.

  • We have widespread use of certificate-based authentication.This not only required coordination from the folks over at Couchbase to properly support certificate-based authentication on both the server and the respective SDKs, but it also required work from our side to integrate with our in-house certificate management system.

  • We also launched a LinkedIn wrapper around the Python SDK. As more and more teams are building Python apps that use Couchbase, we thought it was time for a wrapper library to exist in Python as well. It provides niceties like client-side metrics, integration with our configuration management system, client-side encryption, and compatibility with the flags used by our Java wrapper.

Challenges

One of our biggest challenges with spinning up the dedicated team was that we suddenly had a flood of requests of people wanting to get onto our platform ASAP for a large variety of use cases. This was challenging for a couple of reasons:

  • We were already heads down trying to migrate existing Community Edition buckets to our platform.

  • People basically read “Couchbase is being offered as a service! I want to use N1QL or XDCR or Views!” The problem with this was that Couchbase was brought into LinkedIn as a caching, key-value use case, as there were already other technologies internally for other types of use cases. We’ve been running Couchbase as a cache for years, and we knew caching was where it excelled. These other features of Couchbase are great and we want to support them one day, but for now, we need to focus on what we have been good at while we finished becoming the centralized owners of all Couchbase clusters at LinkedIn.

We know that a large part of this is a communication issue, and we’ve been working hard to bring a consistent and firm message across the company as to exactly what services our team provides.

Looking Forward

Tracking the evolution of Couchbase at LinkedIn makes a lot of sense when you look through where it has been through the years. Each step of the process was definitely necessary, and LinkedIn as a company learned a lot about using Couchbase at scale and in high-performance scenarios. We still have many things that we want to do with Couchbase here, so there’s no shortage of work. Some aspects we want to focus on are:

  • Improving our multi-tenant solution and reaching better resource utilization.

  • Working towards making our platform completely self-service. We already have a self-service tool internally called Nuage for provisioning data stores, and we are planning to integrate Couchbase so that clients can focus on getting their Couchbase buckets provisioned and we can focus on maintaining and automating infrastructure.

  • Ultimately, making the cache invisible to our clients by partnering with source-of-truth platforms to build tighter integrations (i.e. out-of-the-box invisible caching).

At the end of the day, Couchbase is a caching solution that ended up working for us and we’re excited to work with them on a mutually beneficial product roadmap.



Source link