Dalton is a software engineering intern on Coursera’s Growth Monetization team. He is from Toronto and studies Software Engineering at the University of Waterloo. He chose to do an internship at Coursera because he believes in their mission of creating a world in which anyone, anywhere, can transform their life by accessing the world’s best learning experiences. This article is about the work he did to improve Coursera’s subscription renewal system.
What Is a Specialization?
One of Coursera’s educational products is a specialization, in which a learner can complete a series of rigorous courses, culminating in a hands-on capstone project, to master a career-specific skill. A learner pays for access to the courses in a specialization and its credentials through a monthly subscription.
How Does Coursera Charge Learners for Subscriptions?
Coursera uses a third-party online payment service provider (PSP) to process subscription payments. When a learner purchases a specialization they are charged the subscription fee at the start of each billing cycle. Once the payment has been made the learner is granted access to the specialization for the duration of the billing cycle.
Renewing Subscriptions — the Polling Model
How does Coursera know when our PSP has auto-renewed a learner’s subscription? A recurring job polls their subscriptions API for updates on the status of subscriptions. If the subscription’s status changes on the PSP we update the subscription on Coursera to reflect the change. For example, if a subscription was successfully charged for the next billing cycle we grant the learner access to the specialization for the duration of the billing cycle.
This implementation was great for Coursera as a smaller start-up. It was quick, easy, and got the job done. However, as Coursera grew we began to see some drawbacks.
Areas Of Improvement
Imagine a learner who wishes to cancel a subscription because they have decided that the specialization isn’t right for them. They log in to Coursera go to their purchases, and cancel their subscription on the last day of the billing cycle. Little do they know that the PSP has already charged them for the next billing cycle! The learner couldn’t see this because the recurring job still hasn’t run yet, and therefore the update is not reflected on our site. The learner is likely confused and unhappy.
While this situation is extremely rare, when it happens it requires us to detect and issue a refund, which is not an ideal experience for the learner. This is an important issue because we strive to provide the world’s best learning experience.
We already run the scheduled job multiple times daily to reduce the chance of this edge case happening. There is an upper limit to the effectiveness of this, as the job takes hours to run and any increase will see diminishing returns.
The job uses a heuristic to select a batch of candidate subscriptions that may require an update. Polling the PSP’s subscriptions API for the status of subscriptions that don’t require an update essentially amounts to wasted time and resources. In addition, an API call to the PSP is made for every subscription in the batch. This is manageable now but if Coursera had to process ten times the subscriptions it does today we’re not so confident the renewal job could accommodate the load.
As our platform grows we want to ensure the subscription renewal system can grow with it. Because the polling model does not scale well, we needed to re-engineer the renewal system such that it can scale better and aim to eliminate edge cases that result in poor experiences for learners. In addition, changes to the system need to happen in a safe and auditable manner.
Webhooks as a Solution
Our PSP offers a service called webhooks that we utilized to move off of the polling system. Using our PSP’s webhooks, we can set up an endpoint to receive POST notifications whenever a subscription’s status changes. Instead of polling the PSP for status changes we are notified of them as they occur, following the Hollywood Principle.
Using webhooks eliminates the drawbacks in the polling system. The notifications arrive quickly once the status change occurs, significantly reducing the time between a status change on our PSP and the corresponding change on Coursera. We also avoid making API calls to the PSP.
The PSP posts a notification object to our endpoint. The endpoint bundles information from the notification object into a message and publishes it to a Kafka topic. The consumer is a subscriber to this topic and receives messages for consumption, essentially executing the same business logic that occurs in the scheduled job.
We decided to use Kafka because Coursera has great infrastructure built around its producer and consumer APIs that made it extremely easy to configure a new message stream (a.k.a. topic for those familiar with Kafka) for subscription renewals. Additionally Kafka keeps a record of messages that can be recovered in the event of failure. This is important for the payments system.
One should exercise extreme caution when making a large impact, system-level change. An error in the subscription renewal system could have devastating consequences for Coursera and our learners. Luckily I was able to leverage a handful of tools to minimize potential negative effects on our system.
An obvious technique is to consume only a fraction of the messages published and slowly increase the volume as we gain confidence. I was able to manually validate the correctness of the consumer by analyzing our logs and database for a very small number of subscriptions and then gradually increase the fraction of messages published. The progression was 1%, 5%, 10%, 25%, 50%, and finally 100%.
Encoded in the Kafka topic’s messages is the type of notification received, and the notification type determines the action performed on the subscription. For example, one notification type is Subscription Charged Successfully. When the consumer receives this notification it extends the learner’s ownership of the specialization associated with the subscription. Another type is Subscription Canceled. When the consumer receives this notification it revokes the learner’s ownership of the specialization.
Our PSP conveniently allows us to configure the types of status updates for which we receive POST notifications. I used this tool to start with one type, enabling more notification types only once I was confident the system was performing the correct actions.
Polling As a Fallback Mechanism
The webhook system may fail to process a subscription for many reasons. An error could occur in the publisher, on Kafka, or in the consumer. A failure at one of these points means the subscription doesn’t update in Coursera’s database. The scheduled job still runs daily and “catches” subscriptions that failed to update via webhooks. Thus we have a hybrid system that processes most subscriptions via asynchronous messaging and the remaining ones via polling.
Logging and Metrics
Abundant logging in the components of the webhook system gave me confidence I could quickly identify the root cause of potential bugs and retain an audit trail of events.
Use logging/event tracking extensively: When making a high-risk change, the more detailed your audit trails are the better. If something goes wrong (and you should always prepare for something to go wrong) you can use your detailed event tracking to find the root cause. If something doesn’t go wrong… you can give yourself a pat on the back and simply remove the excess event tracking code.
Constantly solicit feedback from other engineers: This point is especially important for an intern such as myself. Given how open-ended this project was it didn’t make sense to work for too long in isolation. My teammates provided me with excellent perspectives and approaches to problems whenever I pulled them in to discuss my work.
Double check your project estimates: Caught up in the excitement of project planning I may have been too ambitious with the project timeline. This takeaway is just a reminder to take a second look at your estimates and ask yourself whether they’re realistic. Are you giving yourself time for being blocked on other people’s work? What about the known unknowns? The unknown unknowns?
During my internship I worked on revamping Coursera’s subscription renewal system in order to improve efficiency and Coursera’s learning experience. It was an incredible opportunity to re-design architecture, manage a rollout, and make an impact at Coursera. Thank you to all of the kind and talented engineers at Coursera who helped me bring this to fruition!