This is the third post in a series of three about Etsy’s API, the abstract interface to our logic and data.In the last posts we covered how we built a new API framework, and we clearly identified the gains in terms of performance and shared abstraction layer between languages and devices. But how did we make an entire engineering organization switch to the new framework?
How did we achieve the cultural transformation to API first? How do we avoid this being the new thing that everyone knows about, but no one has the time to try?
How did we “sell” it?
In our case, we had multiple strategies that worked together.
The first one was communication: It was simple to write new endpoints, and the gains were very clear: Both the performance gain through concurrent data fetch, and the possibility to share an endpoint with the apps, were huge selling points.
We partnered with the Product organisation to make clear that they needed to include a standard question in their project templates “Is this being built on APIv3? If not, why not?”, which enforced the company strategy to adopt Mobile First development.
We had pilot groups that tried out the new API framework for new product features, we partnered closely with the Activity Feed team to help them adopt APIv3. This resulted in early converts that were strong advocates for wider adoption. The benefits were clear to communicate but also so compelling that we didn’t need to do much to sell teams, they could see what the activity feed had accomplished.
We had evolving documentation about the architecture and how to use the framework, in the API handbook. We had a codelab, which engineers could use for self study. In the lab you learn to build endpoints by building a little library app.
We had workshops in which people did the code lab together with experienced API developers.
We had tooling to learn about the system, such as the distributed tracing tool CrossStitch, the fanout manager informing about the complexity of the fetch, and datafindr for general information about API endpoints and example calls and results.
Once it’s going, too fast to keep up
After the word was spreading, so many people were making new endpoints that it was almost impossible for us to keep up with it. Initially, we alerted on each new endpoint, but had to switch this off due to the rate at which new endpoints were being built.
The biggest motivator for adoption was that we communicated the gains. Everyone became interested in the performance gains and the cross-device-sharing through abstraction, which was a motivating incentive to switch. Also, there was a potential speedup through caching of endpoints.
White space gets filled with code
Immediately after adoption started, we saw some misunderstandings in the code that we did not foresee. Such as magic hiding in traits, inheritance between endpoints, while they contain just declarative static functions, or really complex code in bespoke endpoints, which should be library code. The minimalist code required to write an endpoint didn’t make explicit what is missing by accidental omission vs. deliberate design decisions.
This also lead to confusion about which building blocks of endpoints are mandatory, about how to opt into a service, naming conventions, the location in the file system, and the complete route of an endpoint.
This was a caused some cleanup work for the API team, but in fact it was a good thing, caused by fast and wide adoption of a new system, and is probably inevitable in that case.
Documentation is critical
We addressed the problems by improving our documentation, and creating an interactive endpoint generator that creates a file in the right place, with stub functions, and outputs the route.
Also, we added format checks for endpoints in Jenkins. It was helpful to have the API code lab as a narrative introduction to gain practical experience fast, while also developing the API handbook as a reference manual. Two goals, two documents.
Pockets of Patterns
When we internally announced the project in which we unified the format for how an API endpoint is written, we started getting emails about what developers wanted from the API. And also: Mails about patterns that had emerged within the API framework without our knowledge. People had found and implemented solutions for their specific use cases. Our discussions had opened up a space to evaluate those patterns, and share them with all API endpoint developers!
We’ve learned that we need to stay on top of new patterns as they emerge, and keep talking to developers about their needs. A subtle, but powerful paradigm shift:
We, as the API framework developers, listen to endpoint developers and address everyone’s needs by evolving the framework. Instead of having developers using our API framework as a service, we are serving them.
Also, we learned to trust our fellow developers even more! We underestimated their curiosity and willingness to try out new things, we underestimated the adoption rate, and we also underestimated how creative they would be with finding solutions for their specific use cases that we did not plan for. What an awesome surprise. 😀
Types too late + types too loose D:
Also, we underestimated everyone’s willingness to do the work of typing their endpoint result. In the beginning, we made specifying a resultType optional, because we feared making it mandatory would slow the adoption. And if a type was specified, we only sampled the type errors, to not make correct typing a hurdle during API endpoint development, but rather a “nice to have” hint for when things go wrong. Not a guarantee. In retrospect, we could have saved ourselves a lot of extra work if we had made the resultType mandatory, and if we had made the type errors more prominent from the beginning.
Etsy’s developers are generally happy about result types, they helped to implement a coarse grained gradual typing, and actively pushed us towards making the result types mandatory, to rely on them as a guarantee.
Work in progress
There remain four open problems we all touched upon:
- Active Cache Invalidation – it’s hard.
- Caching the API at different geographic locations (Edge)
- opening up v3 to 3rd party developers
A question that comes with the developer adoption question is: What are we thinking in terms of preparing for third parties? This is a valid question, and we did not fully answer it yet, because we switched to generated client code in the meantime.
Even though we started announcing the new API v3 on our developer mailing lists, all third party apps are still on version 2. The platform for v3 is ready to open up to 3rd parties, as soon as Etsy as a whole decides to make the switch.
What did we learn? How were we transformed?
This story is a case study of how the API first approach transformed the architecture, and also transformed how we work with and think about our API at Etsy. It covers a lot of ground and is influenced by our infrastructure and history. The story of architectural decisions and adoption is transferable to other systems though. So what did we learn from the decisions? Or from the decisions not being made yet? Were there surprises?
We learned that we can seamlessly grow a system from a hack day project to a live system. Over time, a domain specific language of endpoints evolved, and the system grew according to the endpoint developers needs.
Also, we learned a great deal about cURL or the according php extension! Not only does it allow to make parallel requests, but it also let’s us check on, control and modify the in-flight requests in a non-blocking way.
Another realization is that we should have thought about caching early on. We added it in a hurry and are still working on active cache invalidation. So far, we can only use timeouts, which limits the class of endpoints that can be cached. Also, we have to think some more about variations due to different locales, which might only vary for parts of the response.
A huge positive surprise was the HHVM experiment. By teeing traffic and trying out a completely new system, we solved our performance problems to this day.
The textbook approach says: design APIs contract first. As we have seen, we circumvent that part by using automated client generation. Components help us with abstraction, but even the bespoke layer is very malleable. This is an interesting trick.
Another surprising learning is that we should trust our developers not only to adopt and adapt to the new API framework, but also to go all the way and make typing of endpoints mandatory from the start.
Also, we should keep the conversation ongoing to find out how the API framework can serve developers and their needs, instead of just designing it as a rigid service. We even found solutions that they had created themselves, and the framework could officially adopt them for everyone’s benefit.
Did it work? How did it end? Did we succeed?
Let’s assess this with a quote from Etsy’s Andrew Morrison, from before all this work started:
“We desperately need to figure out a scheme for allowing concurrency or else we’re going to have performance problems forever.”
YUP! We solved this, and a few unexpected things along the way. Despite initial problems with the extra layer, we figured out unconventional solutions by experimenting and organically growing the system towards the developers needs.
What we are discussing now is how to shift some of the complexity back towards the client. Maybe GraphQL is an interesting approach? Right now, the clients don’t know how the queries will be executed. This makes sense if you have services and teams and clear cut boundaries and interfaces, and if you have a contract in form of an API. Our approach is currently not structured like that.
But could we compile an alternative, more knowledgeable PHP client, that lifts the composability from the HTTP layer into the API consumer code, in cases where we are our own consumer and create a website of the same tree structure? It’s clear in this case how many endpoints we are calling, and as long as it’s us, it’s safe to lift this control beyond the client, into the view.
This is the third post in a series of three about Etsy’s API, the abstract interface to our logic and data.