Air Traffic Controller: Member-First Notifications at LinkedIn


Steps a notification goes through in ATC

 

Request decoration
One of the first steps in the ATC process is getting the metadata needed to make a notification decision. This metadata may include the member’s historical notification profile, the devices they have the LinkedIn app installed on, and their notification settings. ATC gets this information from multiple local stores (RocksDB) and from external services. ATC will also log the current notification request to help continue to build the member’s historical notification profile.

Channel selection
As mentioned previously, LinkedIn can notify members through multiple channels: email, SMS, desktop notification, in-app notification, and push notification. While most members use the LinkedIn app, LinkedIn also has a number of other apps in its mobile ecosystem (e.g., LinkedIn Lite, LinkedIn Sales Navigator, etc.). ATC ensures the notification gets sent through the best channel and app to create the best possible member experience. This decision is made based on various factors, including business logic, member settings, and relevance models predicting a member’s click and notification disable rate on the fly. For example, if a member has the app installed and enables push for “Updates about your network,” ATC will begin to send push notifications instead of emails for notifications about their network activity.

Aggregation
Receiving excessive notifications is annoying. No one wants to open their email and see hundreds of individual emails from LinkedIn. Based on a member’s historical notification profile, ATC can delay specific notifications, place them in a notifications queue via RocksDB, and then send a group of notifications together as one notification at a later time. For example, if ATC schedules a number of member-to-member invitation reminders to be sent in a week, ATC will group these reminders together and send them in one email. Aggregation rules can be defined via business logic, relevance scores, and member settings (e.g., weekly digest emails). In addition, within the aggregated notifications, we are able to rank these notifications and surface the most relevant or important ones to the top by leveraging relevance signals, which ATC stores from offline push jobs.

Delivery time optimization (DTO)
When requests are placed in the notification queue, we can optimize for when the best time to deliver this notification is. ATC leverages the member locale information and will send the notification at a time when the member is most likely to engage with it. For example, we don’t want to send a member a notification about who’s viewed their profile while they are sleeping, but we may want members to have the Daily Rundown first thing in the morning or during their commute to work.  

Reminder
At times, we may want to remind a member about a new job opportunity or important connection request. If the member hasn’t clicked on the notification or item in LinkedIn, ATC may send a reminder notification at a later time to remind the member about this opportunity.

Filtering
Filtering is the guardrail for a member’s notification experience. To provide our members with the best possible experience when interacting with LinkedIn, we may want to prevent some notifications from being sent. A couple of reasons we might block a notification from being sent to a member include:

  • Member has already interacted (e.g., liked, commented, etc.) with the content on-site.

  • A duplicate notification is being processed.

  • The content of the notification has expired.

  • To prevent a member from being overloaded with notifications. ATC will rate-limit upstream applications to prevent them from accidentally spamming members.

Relevance-based decision making
Initially, ATC made decisions for each of the previously mentioned features using business rules. However, to deliver more personalized notifications for our members, we embed machine learning models into multiple stages of processing. For example, for several types of notifications, ATC scores the notification based on the recipient member’s historical notifications and historical actions to predict the likelihood that this member will act on or disable the notification. With this score, ATC can choose in real time whether to drop the notification, only send the notification as an in-app notification, or send both an in-app and push notification. Similarly, ATC uses relevance to optimize a member’s notification portfolio. Since ATC knows about all the notifications that can potentially be sent a member, ATC does a holistic optimization for the notifications sent to a member, selecting a unique set of notifications for a member based on their interests and dropping less relevant ones.

Challenges and design decisions

To best leverage notifications and provide LinkedIn’s more than 546 million members with the best-possible member experience, ATC must be a scalable, redundant, and low-latency platform. Here are some of the challenges we faced when creating ATC and the design decisions we made.

Partitioning requests by recipient
ATC is a member-centric notifications system. The ATC platform is optimized to make personalized decisions for the recipient about each notification. In ATC, we accomplish this by partitioning the input streams consumed by ATC, which contain notification requests, tracking signals, member settings, etc, by member id. Partitioning all requests and signals by recipient ensure that all notifications for a specific recipient are routed to the same host and more specifically, the same Samza task. Since all the data for the member is available on the same host, data lookups are fast. If the data either cannot be partitioned by the recipient (e.g., spammer check for sender) or ATC needs additional information about the sender of the notification (e.g., a member’s connection or a recruiter), ATC will make a small number of remote calls to the necessary external services.

Storing external signals in local store
ATC consumes member tracking events, member mobile device information, and relevance scores pushed from offline jobs and puts this data into the RocksDB instance residing on the SSD of the processing host. The main reason for using this local RocksDB instance is to reduce the latency to retrieve this data and increase the throughput. Reads to RocksDB only take a couple of milliseconds to compete, while remote calls can take from 10ms to 100ms. Because of the partitioning by member id, all data for one member are accessible from the same host with low latency.

Scalability, fault tolerance, and redundancy
ATC is built on Samza, which provides scalability, fault tolerance, durability, and resource management. Within the data center, ATC redundancy is achieved by sending all local states to Kafka. If the job needs to be rescheduled on a new host, the stateful machine will be rebuilt by consuming the stored states from Kafka. ATC also has cross-data-center redundancy by running a copy of the job in additional data centers. If the job in one data center goes down, traffic can be shifted to another data center.  

Traffic patterns, latency, and throughput
ATC processes many types of notifications. Each notification type may have drastically different traffic patterns and latency requirements. For example, a daily offline job like the Daily Rundown may have a daily batch of a hundred million requests and can tolerate a latency of a couple hours; however, member-to-member messaging has a steady QPS and requires the latency to be less than a second. For the Daily Rundown, we’ll spread the requests over a range of time using aggregation and delivery time optimization. For messaging, we use high-priority Kafka topics to prioritize member messages.

One way to decrease latency and increase performance is to increase the throughput in ATC. ATC aggressively makes use of thread parallelism to handle remote calls via the Samza Async API. This helps bring down the 90th percentile (P90) end-to-end latency for member-to-member messaging push notifications from about 12 seconds to about 1.5 seconds.  

However, the tradeoff for parallelizing the request processing is that requests might be processed out of order. Checkpointing in Samza is sequentially increased; therefore, if a request is processed, but the ATC Samza job restarts before the checkpoint is set for that request, the request will be processed again. ATC handles the duplicated requests by dropping any request it has already processed, but is okay if the requests are sent slightly out of order.

Range query
Every time a notification is processed in ATC, ATC needs to read from various local RocksDB stores. To increase performance, it needs to read multiple items from a single store at one time (e.g., all notifications delivered to a member for the previous X days). ATC uses range query to take advantage of data locality in-disk. Also, range query allows us to make queries when we don’t know the full key, for example, the total number of requests scheduled in a certain time period. One alternative could be storing everything as a list in a single entry. In some ways, this is better than a range query (e.g. read query time since you only read one key-value pair). However, it has disadvantages like updating a list requires a DB transaction, data retention is harder, and storage entries can be very large.

RocksDB is based on a log-structured merge-tree and the keys are sorted by serialized bytes. Therefore, we designed our keys leading with the member’s ID, so that items belonging to one member are always located continuously together on disk. For items that need to be accessed by a scheduled time, the keys also include this timestamp, such that ATC can easily do a time range query.



Source link