Optimization strategies for Venice read path
Venice Client: D2 sticky routing with router cache
According to the operating experience of Venice, we noticed that several customers encountered hotkey problems, which are caused by a skewed access pattern, and this hotkey issue has made the quota allocation much harder, since the quota is per storage node. To alleviate this problem, we introduced the router caching layer.
Internally, Venice Client internally uses D2 as the load balancer and the built-in sticky routing feature inside D2 is useful for caching. D2 will try its best to send the same request to the same instance. With this routing strategy, each router will only serve a unique set of keys based on key hashing, so the cache inside each router should have minimal duplication, and the whole router cluster will work as a distributed cache layer.
Based on our experiment, the hotkeys hit the router cache all the time, and the latency of cache-hit requests is reduced by about 40% compared to cache-miss requests.
Venice Router: Pool of connection pools, router-to-storage sticky routing, and long-tail retry
To relay client requests to a storage node, Venice Router used to maintain a global connection pool to alleviate connection setup overhead. This connection pool was shared by all the requests coming to the same router, and connection checking in/out was synchronized by the same lock, which caused a lot of contention during high QPS. We didn’t figure out a good way to avoid this lock contention by using one single connection pool, so we tried another strategy: using a pool of connection pools. With this strategy, the lock contention behavior was mitigated, and both the throughput and latency were improved.
Venice is a distributed storage system, and it maintains multiple replicas for each partition. Without enforcing any scattering pattern, each replica will serve a full set of key space, which causes the memory inefficiency of the Venice Storage Node. By enabling the sticky routing feature in Router (similar idea as D2 sticky routing), each replica only needs to serve a small portion of key space, which has demonstrably improved the memory efficiency of the Venice Storage Node.
Whenever the application is implemented in Java, GC will be a problem to tune all the time. The Venice Storage Node also has GC issues during high-volume data ingestion, and GC tuning will be a long-term effort. GC hurts the end-to-end latency during GC pause. To mitigate this GC impact, we have adopted a long-tail retry strategy when GC in Storage Node happens. The basic idea is to retry the request to another Storage Node hosting the same partition when the waiting time of the original request exceeds a predefined threshold. With this feature, the P99 latency has been reduced greatly and becomes more predictable than before. Long GC pauses are not happening all the time, but just during large volumes of data ingestion, and the long-tail retry feature gets triggered during long GC pauses, which is why P99 latency looks much better now.
Venice Storage Node: BDB-JE shared cache
BDB-JE allows each database to setup the size of the cache, which could be used to store the index, and maybe data entries as well for fast lookup. Initially, Venice was assigning fixed-size caches to every store, which was tedious and vulnerable because the required cache size varied from time to time, and it is also very hard to assign the optimal cache, size since it depends on both the actual database size and access pattern. To avoid this complexity, Venice modified the behavior to only allocate a big chunk of cache, which is being shared across all the stores hosted in the same node. This way, the cache allocation is much simpler and can be utilized efficiently even when only a small number of stores receive high volumes of read requests.
GC pauses in Venice Storage Node
We haven’t gotten to the bottom of the GC issue in the Venice Storage Node, and it is worth exploring in the future, since it still causes some slightly high latency for P99.
Latency improvement for batch-get request
So far, most of the above optimizations only apply to single-get request, and Venice cannot make any latency guarantees about batch-get requests. We would like to optimize for batch-get latency in the near future.
Performance tuning is hard, and is a continuous process. As data volume and traffic increases, we may need to adjust those optimization strategies accordingly in the future. For now, Venice is very close to the original goals, and we will continue this effort to make it more stable and efficient.
Brainstorming/verifying these performance tuning strategies has been a team effort. Here I would like to thank my colleagues Charles Gao, Felix GV, Mathew Wise, Sidian Wu, and Yan Yan for their invaluable contributions and feedback. Special thanks to Jiangjie Qin, Dong Lin, and Sumant Tambe from the Kafka team for their input and suggestions. Thanks to the leadership team for supporting our efforts: Siddharth Singh, Ivo Dimitrov, and Swee Lim.