Building a dynamic and responsive Pinterest – Pinterest Engineering – Medium


Bo Liu | Engineering Manager, Serving Systems team

In 2015, the majority of content on Pinterest was pregenerated for users prior to login. It was stored statically in HBase and served directly upon entering the service. (More details can be found in the blog post Building a smarter home feed.)

Although our earlier architecture helped us grow to 100 million monthly active users by 2015, it had several weak points that prevented us from building more dynamic and responsive products. For example, it was hard to experiment with different ideas and models on different components in the system. Since content was pregenerated, the features used to rank candidates could be weeks old, and we couldn’t leverage the most recent Pin/Board/User data, not to mention real-time user actions. We also had to pre-generate and store content for every user, including those who never returned. Moreover, we were constantly running a large number of concurrent experiments and needed to pre-generate and store content for each experiment. The storage cost was huge.

Though a dynamic and responsive Pinterest is more attractive to Pinners, it imposes demanding requirements on our backend systems.

To solve the technical problems, we built nine different systems that collectively power today’s fully dynamic and responsive Pinterest products. Though these systems were built for Pinterest, they solve common problems to many web-scale consumer-facing content distribution applications. Here, we’ll discuss these systems and how they changed the backend architectures powering the major Pinterest products, including Following Feed, Interest Feed, and Picked For You (recommendations).

Challenges

Firstly, a following graph, Owner-to-Board mappings, and Board-to-Pin mappings need to be stored and updated in real-time. Though we have this information in our MySQL and HBase clusters, it takes too long to query them from those data stores for online applications (more than one round trip per request, big fanout and large data scanning).

Another big challenge was the lack of a high performance machine learning ranking system to rank thousands of pieces of content per request with a P99 latency of small dozens of ms. To provide feature data to this ML ranking system, some stateful services with batching and real-time update support are needed. These stateful services must provide not only KV data model, but also more complicated data models like counts, lists, etc. These stateful services must answer queries with single digit P99 latency given the latency requirement of the ML ranking system depending on them.

Lastly, we need a candidate generation system to provide high-quality candidate sets in real-time for content recommendations.

Following feed

Fig. 1: Former Following Feed

Fig. 1 depicts following feed circa 2015. Whenever a Pin was saved, an asynchronous task was enqueued to Pinlater, an asynchronous job execution system. This task would first retrieve all (direct and indirect) followers for the Pin’s board from MySQL via the follower service, and then it would send the (follower list, Pin) to a smart feed worker. The smart feed worker would leverage Pinnability to score the Pin for every follower and then insert the list of (follower, score, Pin) into HBase. When a user came to Pinterest, we scanned HBase with the user id as the prefix to extract the Pins with the largest scores for this user. This implementation consumed unnecessarily large storage space and made it hard to leverage fresh signals and experiment with new ideas. Additionally, the long latency for fetching the follower list (steps 2 and 3) and ranking (step 5) prevented us from experimenting online.

Fig. 2: Current Following Feed

In the current version of the following feed shown in Fig. 2, components in green are newly built systems, while the database shape indicates stateful services (we use the same terminology throughout this post). The mapping from user to directly- and indirectly-followed boards is stored and real-time updated in Apiary, which supports low latency queries suitable for online applications. Board-to-Pin mappings and some lightweight Pin data are stored and real-time updated in Polaris, which not only supports board based Pin retrieval but also filtering with passed-in bloom filter as well as lightweighted scoring to select high quality pins.

When a request is sent to Feed Generator to fetch a user’s following feed, it concurrently fetches real-time signals for the user, boards followed by the user, and Pins seen by the user recently from RealPin, Apiary and Aperture, respectively.

RealPin is a highly customizable object retrieval system providing a rich data model for object storage and data aggregation along the time dimension. We customized RealPin to track and serve real-time user signals. Aperture was firstly designed for content deduping; it stores all user events, including backend and frontend Pin impressions, and returns impression history of up to a few months for any user, all with single-digit P99 latency. Aperture was later adapted to serve the ads user action counting use case as described in the blog post Building a real-time user action counting system for ads.

Feed Generator then sends to Polaris the board list and the bloom filter consisting of the impression history of the user. After retrieving, filtering and applying lightweight scoring, Polaris returns a list of Pins to Feed Generator. Lastly, this list of Pins and real-time user signals are sent to Scorpion for a second pass of full scoring. Scorpion is a unified ML online ranking platform powering the majority of Pinterest ML models in production. We have Counter service and Rockstore underlying Scorpion to provide count signals and user data, pin data, etc. Note that Scorpion aggressively caches static feature data in local memory, which is the larger part of all feature data required by ML models. Scorpion is sharded to achieve a cache hit rate over 90%.

Interest feed

Fig. 3: Former Interest Feed

Fig. 3 depicts the 2015 version of the interest feed architecture, which is similar to the following feed described in Fig. 1. The main difference was content generation was triggered by daily jobs, and source data was stored in HBase instead of MySQL.

Fig. 4: Current Interest Feed

Picked for you (recommendations)

Fig. 5: Former Picked For You (recommendations)

The “Picked For You” architecture from 2015 is depicted in Fig. 5. Its content generation was also triggered by daily jobs. A periodic offline job generated a list of boards for each user as the seeds for Pin recommendation. These seed boards were batch uploaded to Terrapin for serving.

Fig. 6: Current Picked For You (recommendations)

Fig. 6 depicts the current “Picked For You” architecture. Pixie is a new service built for real-time board recommendation. It periodically loads into memory an offline-generated graph consisting of boards and Pins. When recommended boards are requested for a user, a random walk is simulated in the Pixie graph by using the Pins engaged by the user as starting points. (More can be found in the blog post Introducing Pixie, an advanced graph-based recommendation system.) The rest of the systems are similar to following feed.

Brief technical discussions

Throughout the process of building systems from scratch to enable fully dynamic and responsive Pinterest products, we needed to make design decisions with reasonable tradeoffs, solve technical problems, and optimize systems to meet the latency requirement for online applications.

Back in 2015, the majority of Pinterest’s backend systems were implemented in Java, and we hadn’t built any systems in C++. As we can see from the previous sections, the new systems must achieve low long-tail latency with big fanout (sometimes all shard fanout), and some systems are CPU intensive (e.g., Scorpion, Pixie and RealPin). We chose to adopt C++11, FBThrift, Folly and RocksDB to build these systems. It was slow at the beginning since we had to install all dependencies, build several basic facility libraries (like stats reporting, request routing, etc.), and set up our build & release environment. It eventually paid off as we used our new foundation to get more and more efficient and effective systems built across the company.

RocksDB is an embedded storage engine. To build sharded and replicated distributed systems on top of it, data replication was the first problem we needed to solve. We started with the write-to-all-replica approach and later moved to master-slave replication. Our systems are running on AWS, which models its network into Regions, Availability Zones and Placement Groups. Notably, the bill for cross-AZ network traffic is significant. We built a prefix-based AZ-aware traffic routing library that minimizes cross-AZ traffic and supports all possible routing patterns (e.g., single shard, multiple shard and all shard fanout). The library also monitors TCP connection health and gracefully fails over requests among replicas. One thing to note is we needed to leverage the TCP_USER_TIMEOUT socket options to fail fast when the OS on remote peer crashed. It is not uncommon for a VM instance to become unreachable on AWS for various reasons without shutting down TCP connections. If TCP_USER_TIMEOUT is not set, a typical TCP implementation could take over 10 minutes to report the issue to user space applications. (More details about data replication and traffic routing can be found in the Rocksplicator Github repo and blog post Open-sourcing Rocksplicator, a real-time RocksDB data replicator.)

Over 10% of Pinterest’s total AWS instances run our systems. To reduce the operational overhead and service downtime, we integrated Apache Helix (a cluster management framework open sourced by Linkedin) with Rocksplicator. (More details can be found in the blog post Automated cluster management and recovery for Rocksplicator.)

We did numerous optimizations and tunings when implementing and productionizing these systems. For instance, we needed to tune the RocksDB compaction thread number and set L0 and L1 to be of the same size to reduce write amplification and improve write throughput.

One of the counter service clusters must support returning tens of thousands of counts for a single request with a P99 latency less than 20 ms. To achieve that, we switched to RocksDB plain table and configured it to something essentially like an in-memory hash table. We also had to switch to float16 to reduce data size and manually encode lists of returned counts into binary strings to save serialization and deserialization overhead. For big requests, counter service may also leverage multiple threads to process a single request.

Though RealPin was designed as an object retrieval system, we customized it to run as an online scoring system for several months. However, we noticed it was a challenge to operate the system given it co-locates computation and storage. This issue became more serious as we started to use more types of feature data. Ultimately, we developed the new Scorpion system, which separates computation from storage and caches feature data on the computation nodes.

As Scorpion is CPU intensive and running on big clusters, we needed to invest heavily in optimizing it. We carefully tuned the Scorpion threading model to achieve a good tradeoff between high concurrent processing and low context switch or synchronization overhead. The sweet spot is not fixed, as it depends on many factors such as if data is fetched from memory, local disk or RPC. We optimized the in-memory LRU caching module to achieve zero-copy; i.e., cached data is fed into ML models without any data copying. Batch scoring was implemented to allow GCC to better utilize SIMD instructions. Decision trees in GBDT models are compacted and carefully laid out in memory to achieve better CPU cache hit rates. Thundering herds on cache miss are avoided by object level synchronization.

Data stored in Aperture is bucketed along the time dimension. We use a frozen data format for old data which is immutable and suitable for fast read access. More recent data is stored in a mutable format which is efficient for updates. The max_successive_merges RocksDB option was leveraged to limit the number of merge operands in mem-table from the same key. This setting is critical for Aperture to achieve low read latency since it may need to read a RocksDB key with a large number of merge operands, which are expensive to process at read time. To save storage space, RocksDB was configured to use different compression policies and multiplier factors for different levels.

If these types of challenges interest you, join our team! And keep an eye on this blog for more posts and information on the open sourcing of some of our systems.

Acknowledgements: In addition to the Serving Systems team, we’d like to thank our client teams for providing useful feedbacks, helping building and adopting the systems — Home Feed, Related Pins, Ads, Interest, Applied Science, Visual Search, Spam, etc.



Source link