Over the year and a half that we’ve been using Iris, we’ve onboarded many new services, and consequently have seen an explosion in the number of incidents it handles. However, with that growth comes the responsibility of guaranteeing reliable escalation for our users. Because of Iris’ unique place in LinkedIn’s monitoring and alerting stack, a single small misstep can result in a huge impact to the entire company. Absolute reliability is incredibly important, since so much of LinkedIn’s monitoring is routed through Iris.
Given this position, perhaps it is surprising that Iris itself has experienced only one major outage in its lifetime at LinkedIn. Though no system is perfect, Iris is remarkably reliable, and its stability is in large part due to one of our core design principles: keeping Iris simple. Iris actually has a very small number of moving parts; message delivery is abstracted away by Twilio and other messaging vendors, and alerting is controlled by outside triggers. This means that Iris only concerns itself with ensuring that incidents are acknowledged. Though it has additional quality-of-life features to make incident acknowledgement easier, at its heart, Iris is just a messenger. Limiting the scope of Iris to delivering reliable messages has allowed it to become a focused, elegant, resilient service that is a cornerstone of our alerting system today.
Culture of contribution
Another key influence on the internal development history of Iris and Oncall at LinkedIn was the contributions from other teams to both of these projects. Much like an external open source project, the positive reputation of both Iris and Oncall led to many teams wanting to extend these features for new use cases. This creates a virtuous cycle where the projects become more applicable for more users, and therefore attract more contributions as a result.
Future plans and development
In designing and developing Iris, we decided to build our own escalation system partially based on cost, but also based on the advantages of being able to tailor the system to our own specific use cases. In addition, we found the incident escalation and on-call management domains to be mostly unfilled in the open source community, and we’re happy to fill in the gaps by presenting Iris and Oncall.
By providing Iris and Oncall as open-sourced products, we can offer the community a production-ready escalation system that is free, open, and growing. We have lots of plans in store for these products, ranging from making Iris’ sender more reliable to improving UX in dealing with Oncall’s automatic scheduler. Code and documentation for these products can be found at https://iris.claims and https://oncall.tools, and we welcome any potential users or contributors. You can also reach our team with general questions about either project by emailing email@example.com. Iris’ and Oncall’s stories are just beginning, and we’re excited to see our products evolve and grow into a world-class incident management system.
Iris and Oncall were created from the hard work of Wen Cui, Saif Ebrahim, Joe Gillotti, Qingping Hou, Fellyn Silliman, Jessi Reel, and Daniel Wang. Thanks goes to Richard Waid and the entire Monitoring Infrastructure team at LinkedIn, as well as the SRE organization as a whole.