Tian-Ying Chang | Pinterest tech lead, Storage & Caching
HBase is the backbone of several critical services at Pinterest. It serves more than 10 petabytes data at 10M+ QPS. The majority of the traffic is from real-time online requests that require low latency, so it’s critical that clusters are high performing with high availability. In order to get the most out of HBase and ensure the best experience for Pinners, we upgraded our branch from 0.94.26 to version 1.2. However, we found the process of upgrading would take at least several hours. Since we can’t shut down Pinterest each time we upgrade an HBase cluster, we built a system to support the upgrade and rollback with zero downtime.
Before the upgrade, we were running a private branch based on HBase 0.94.26, but this version was officially deprecated in early 2015. HBase 1.2 has more than 5,000 completed JIRA cases over 0.94.26 which includes important bug fixes and new features that improve robustness, scaling, performance, MTTR (mean time to recovery) and more. If we upgraded to version 1.2, we’d realize major long-term benefits.
However, upgrading from version 0.94.x to 0.96.x or higher isn’t supported due to many incompatible changes including both APIs and wire format. This means 0.94.x clusters must be shut down completely, including slave clusters if replication is enabled, and clients have to be shut down and upgraded at the same time as a result of incompatible API changes. The whole process can take several hours or longer. If rollback is required, the clusters and the clients will have to be completely shut down again, and so we built a system that supports the upgrade and rollback with no downtime.
The ultimate goal is that the clients are simply switched from connecting to 0.94 cluster to 1.2 at the moment of upgrade, with all the preparations done beforehand. The connections to 0.94 is enabled to be read-only. Once it’s verified that data are completely synced between 0.94 and 1.2, the client is instructed to connect to 1.2 dynamically via configuration changes. The whole process should be transparent to services that read from and write to cluster. Rollback should follow the same process, but in the opposite direction by switching client connections from 1.2 back to 0.94 cluster.
This requires a bi-directional heterogeneous replication between 0.94 and 1.2 clusters, where the source data are encoded in a format that the sink can recognize and decode into a format native to its version. This allows the data to be synced between the two clusters in real time. Next, a dual-client auto-detects the version of connected cluster and correspondingly encodes the requests and decodes the responses in formats that are native to the version.
We adopted a Thrift-based replication patch from Yahoo! and Flurry, which adds the option to set replication protocol between a specific master-slave pair as Thrift. Then, Thrift-based replication RPC is used instead of the incompatible native replication RPC between 0.94 and higher versions. The original patch is for 0.94.26 and 0.98.10. We made a few changes to make it work in our private 1.2 branch.
It’s essential to verify data correctness between replicated clusters. We built a tool (called checkr) that scans every row in the master, verifies a matched row exists in the slave cluster, checks all the cells in the two rows having same data and vice versa. The process is pretty straightforward if there are no writes to the clusters during verification, otherwise the data may differ since HBase replication is asynchronous. A typical case is that data changes in the master may not be applied to the slave when checkr reads it, due to replication lag. Thus, checkr has the latest copy of data from the master but out of date copy from the slave. To solve this problem, checkr will only compare HBase cells whose data are not recently modified, such as within the last N seconds, where N is configurable. We ran checkr several times at different time points. If no data inconsistency is found in all the runs, replication lag is less than N seconds and no cells were updated in every N seconds, we are confident the data are consistent between the clusters via replication.
Asynchbase is a fully asynchronous and highly performant HBase client. Most importantly, it’s compatible with HBase from versions 0.92 to 0.94 and 0.96 to 1.0. We had been using Asyncbase and keeps a private branch with in-house APIs and other changes. It was fully tested to verify that:
- It works with both 0.94 and 1.2 clusters, including our private features like batch get, small scan, etc.
- There are no issues when switching the backend between 0.94 and 1.2
- There is no performance degradation.
Note that Asynchbase is not a simple drop-in for HTable client. Besides the differences in API, the asynchronous nature of every call means that you either attach a callback to process the response when it’s received, or simply wait for it.
HBase performance, especially low latency, is super important to Pinterest since we serve real-time online requests for several critical services. We must ensure 1.2 cluster’s performance is at least as good as 0.94. It’s ideal that a 0.94 cluster and its test 1.2 cluster have the same data and receive the same requests so that we can compare their performance apply-to-apple.
We set up a dark write/read environment where the applications send the same requests to the test 1.2 cluster but return immediately without processing the responses (in addition to sending requests that go to 0.94 production cluster and processing the responses). We started the test with our largest cluster, both in its data size and its QPS. The assumption was if we see no performance degradation, other smaller clusters would also be fine.
A considerable amount of time was invested in tuning Garbage Collection (GC) for low latency. CMS still gives better latency than G1 in our environments. Here’s a list of relevant JVM options for region servers which run on AWS i2.2xlarge instances.
HBase features and configurations also got tried and tuned. For example, bucket cache enabled by default resulted in worse read latency, because bucket cache in 1.x involves extra copy and memory garbage. We had to disable it and can’t fully utilize the 60GB memory in i2.2xlarge instances (but it still gives us better read latency than 0.94). We’re currently looking at backport offheap read path from 2.0 which could dramatically improve read latency.
We saw better write latency in 1.2–at first. After two new heavy use cases were added to the cluster, which dramatically increased write QPS, write latency in 1.2 became worse than 0.94. After some debugging, we found the latency increase came from WAL sync. The sync code was refactored 1.x. We found it doesn’t do batch well which causes much more frequent WAL sync operations. We lowered the value of
hbase.regionserver.hlog.syncer.count from five (default) to one. This increases the batch size, reduces the WAL sync operations and gives better write latency than 0.94.
Monitoring and tools
HBase 1.2 introduces many new metrics. Some metrics in 0.94 are renamed in 1.2, so a new monitoring dashboard and alerting system was built for 1.2. The previous system for 0.94 will phase out once all the clusters are upgraded.
Admin APIs undergo big changes between 0.94.26 and 1.2. Our internal tools, including region balancing and partial compaction, were also upgraded to work in 1.2. There are additional changes in file structure between 0.94.26 and 1.2. Data backup and recovery tools must be upgraded correspondingly and comprehensively tested.
The upgrade went as planned and the client was pointed to 1.2 cluster, like a failover. It was completely transparent to the services built on top of HBase. We measured the latencies of the service’s APIs, and the performance gain was huge, improving by 124–800 percent. Ultimately, the 1.2 cluster is more robust and has fewer operational issues than 0.94.
Acknowledgements: Many thanks to Jeremy Carrol and Chenjin Liang who worked on this project, as well as Xun Liu for his valuable performance tuning advice.