At Square, the Developers team exposes APIs that allow third-party developers to build custom business-processing solutions. On October 18, we announced to developers using our APIs that webhooks would now support retries. If a webhook cannot be delivered, our system will now be able to retry multiple times until a successful delivery.
Square’s webhook delivery system is a service I’ll refer to as Webhooks. Improving Webhooks’ reliability was a project I had the opportunity to dive into the first week I started at Square as a new college graduate. From the beginning, we wanted our Webhooks upgrades to require no change on the part of our external developers, in stark contrast to the amount of research, planning and work that went into the project. In fact, the announcement we sent out assured developers that they “should not need to make any changes to support webhooks with retries.” My manager remarked that it was “the most invisible change ever” — and that is how it should be. The focus of our team is to enable developers to easily integrate with Square’s APIs and provide them with a reliable and simple-to-use experience. Perhaps it was my manager’s remark about our “invisible” project that inspired me to share some insight into not only our webhooks reliability project, but also some of our “visibility” practices here at Square.
In web development, webhooks are defined as snippets of code (HTTP callbacks) that are triggered by specific events. As soon as an event occurs, a developer is notified of it and can handle it in real time. With traditional APIs, developers may have to constantly poll an endpoint in order to detect events. I wanted to cover how we designed Webhooks at Square and the changes we made to improve reliability. If you want more technical details about Webhooks at Square, check out our earlier blog post. The most engaging way of explaining webhooks is following the lifecycle of a single webhook notification through Square’s infrastructure.
Let’s say that I own an online store called Lindy’s Laughing Llamas and that it uses Square’s APIs. The application that runs my online store receives webhook notifications from Square. When a llama-loving customer purchases a llama from Lindy’s Laughing Llamas, I receive a webhook notification informing me that someone just made a payment. Between the time of purchase to the time of notification, a lot has happened to transform this payment “event” to a webhook notification that I receive. After the customer purchases a llama, the payment event is published to a feed, which many internal services at Square read. The Webhooks service owned by our team reads this payment event from the feed, transforms the event into a webhook notification, and delivers this notification to the endpoint specified by the Lindy’s Laughing Llamas application. However, during periods of high traffic, events could pile up in the feed, and subsequent notifications could be delayed.
Intuitively, we needed a way to separate the reading of events from a feed and the delivery of a notification. Although there are many approaches to solving this problem, the simplest approach is to create separate thread pools for reading events from the feed and for delivering notifications. Our solution was to shift our Webhooks’ delivery mechanism to the cloud using Amazon Web Services (AWS). Our primary motivation for moving Webhooks to the cloud was to lower Square’s system complexity and costs (rather than having to maintain Webhooks in our own data centers). We could improve Webhooks’ reliability by using well-documented and commonly used cloud infrastructure.
In our new webhooks system, the lifecycle of a Lindy’s Laughing Llamas payment event becoming a webhooks notification changes slightly. After a customer purchases a llama and the payment event is published to a feed, our Webhooks service reads it just as it did before. Once the event is transformed into a webhook notification, our service then sends the notification as a message to AWS. Our tools in AWS contain logic to deliver the message to the Lindy’s Laughing Llamas’ application server. If Lindy’s Laughing Llamas takes too long to respond to the message, or is unable to take a message at the time of delivery, AWS will retry. During every subsequent delivery attempt, AWS increases the time in between retries; this backoff strategy ensures that the message will continuously be retried without overwhelming the server. Additionally, AWS sends metrics about all delivery attempts back to Square.
The end result? A developer of the Lindy’s Laughing Llamas application can stop polling Square’s APIs, since webhook notifications will arrive in a timely manner. If the application server is busy, a developer doesn’t have to worry about missing notifications, because they will be retried. From the perspective of the developer, no changes were necessary to get the new timely notifications and retries.
Visibility and Impact
What surprised me at the conclusion of this project was the amount of visibility it had within Square, as well as other projects. On the day before the launch, we sent a company wide product update email. I was impressed with the meticulousness with which my team combed through old emails, work items, and document comments for people outside of our team who contributed to our project. From code review to design advice, the contributions of these people were not forgotten. And within minutes, the email received “Reply All’s”, expressing congratulations and providing context for the impact of a better webhooks on Square’s developer platform.
My team also had the opportunity to share our learning experience. The Developers team has a bi-weekly “lunch and learn” meeting where we get together to present about new technologies, frameworks, and the various services and projects we are working on. These lunch and learn meetings highlight Square’s emphasis on knowledge sharing within the organization. By creating awareness of Webhooks and the technologies our team adopted, the teams we work most closely with can use our learning experience to develop their own projects.
Oh, and Webhooks wasn’t merely visible — it was also audible. After a launch, it is Square tradition to ring the gong, and we let the office know loud and clear that we shipped something new to production.
Being able to share our knowledge and learning experience, both within Square and outside of Square, is meaningful, but it was just as important to hear feedback from our developers. One developer appreciated our back-off retry policy because it avoided “hammer[ing] [the] server when it’s already having a rough time.” In the week after we launched the improved Webhooks, we were able to successfully deliver over 31,000 notifications that had failed on their first delivery attempt. Being able to quickly make an impact on Webhooks — as a new hire still learning the ropes — inspires me to continue improving Webhooks and start on other projects that help not only our fictional Lindy’s Laughing Llamas, but real-world developers who want to use Square to process online and in-person payments seamlessly.