Our new system created a single centralized copy of each message in a conversation, regardless of the number of participants
What might not be obvious from the above list is the critical role of custom business logic during the migration process. The existing system had a significant amount of code that was only required for a single use-case. We ended up creating nearly 60 separate converters for custom pieces of business logic that had been built into the old system—this took six months of untangling highly specialized code.
The end result was that messaging was a tangled mess. The code was owned by the messaging team, but the logic and reasoning behind it were owned by separate teams around LinkedIn. The logic then became more difficult to change because another team was responsible for all decisions around a particular feature. Developers started to be overly careful and eventually afraid of making changes. It began to take months to make what should have been a simple change, such as adding a new use case.
From a more technical standpoint, this complexity ended up causing problems as we attempted to scale across multiple datacenters. For example, the database contained a shard key that could not represent new types of participants in a thread. In addition, multiple places across the codebase had to handle older conversations with the same thread ID as a special case lookup. It also became very difficult to change even small things in the business logic, let alone something as critical as the primary key to a message or thread.
In response to these database scaling issues, an early important requirement for our new architecture was to split aspects of data persistence across multiple services so that each service could manage its own tables and scale independently. While beneficial in the long run, there were some tradeoffs. For example, it took more time and more developers to build these services, and it was no longer possible to handle most writes in a single database transaction. We had to embrace distributed databases and distributed development even more fully than we had before.
The approach we took handled technical, organizational, and historical issues. Beyond solving immediate technical issues, we wanted to address long-term problems in the production environment. Software evolves quickly and is molded to fit an organization, and this was readily evident in the Messaging code. Our engineering culture is extremely collaborative, meaning that people don’t work in silos, and the growing needs of our business necessitated special case products. Ultimately, we had to find a way to separate basic messaging functions from special-case rules determined by partner teams.
After we examined our own system and a few outside messaging products, we found ourselves at a crossroads: Do we iterate on what we have or do we create something completely new?
The answer was not clear-cut. We decided that an iterative approach would likely leave compromises in our system that would still be present today, so we decided to pave the way forward in creating something new, focusing on the following benefits:
- Faster iteration on interfaces and stored data without production constraints
- Ease in enforcement for fundamental architecture changes, such dependency chains and primary keys, from an organizational standpoint
To kick off work on our new system, we had to set the high-level vision and a consistent set of best practices. A large system redesign requires developers to be able to make as many independent decisions as possible without compromising quality, and we needed to ensure that work was done in parallel as much as possible.
To handle all of this, we followed these distinct steps:
- Write a high-level architecture document that laid out major entities and services
- Find leads for each of the major services and divide the work among their teams
- Decide on a set of design principles through a joint discussion with the entire team
- Empower leads to work independently and in parallel
An architecture will only last as long as the teammates maintain good design principles. Once the team strays, it’s easy for a future tech lead to come in and request a rewrite. Good architecture lasts significantly longer if the team working on it is aligned on philosophy. Going beyond the initial design, the team that owns the architecture in the long term must also be well-versed in these principles so that they can maintain a high standard. One of the common problems with large projects stems from short-term thinking when structuring the team. It’s critical to build the team for the long term if you want the resulting system to last—this is why we set up a joint design principles team meeting across stakeholders from the start (the results of which are included below).
Before outlining actual design principles, the messaging team was tasked with collectively answering two questions:
- Why should we have design principles?
- What does it mean to have design principles?
The resulting answers were then divided into four major categories.
1. Efficient research and discovery
- “Understandable architecture”
- “Easy to figure out what the service does.”
- “Services should do one or two things well.”
- “Don’t repeat things. Less is more.”
- “Maintain consistency by following similar conventions.”
- “Do what you say in regard to both systems and people.”
2. Distributed decision making
- “Help us make decisions.”
- “It’s the thing that avoids meetings!”
3. Faster development velocity
- “Allow us to easily make changes.”
- “Let us add more developers to the project.”
- “Deploy quickly.”
4. Easier operations and maintenance
- “Clearly defined service criteria and latency metrics.”
- “Let us scale the system as traffic grows.”
In sketching out categories of improvement, it was clear that tackling these four buckets would greatly improve every stage of the development lifecycle.
First, before making any changes, a developer must figure out what’s going on in the system. Good design principles create systems that do not require specialized knowledge across a large number of different components.
Second, once equipped with this context, a developer must decide what changes to make. If there is only one architect to be consulted for every decision, that one person becomes overwhelmed and creates a bottleneck. Good design principles allow for independence across the entire development team.
Third, as a product becomes more successful, people will request more features. Good design principles allow more developers to implement features without compromising overall agility.
Finally, with a success feature launch comes more traffic. If a system cannot scale for all users, developers will have to spend time fixing it so that it can handle the load of the business. Good design principles allow systems to scale without compromising performance.
Design principles should typically never conflict. However, if there are contradictions, we had a simple set of priorities to guide the Messenger team. The priorities are listed in descending order:
- Correctness. The system must function as advertised and not make mistakes. This includes data corruption, data loss, delivery failures, and user interface issues.
- Build it right. The architecture must scale gracefully with traffic, features, and developers.
- Make it fast. The operations performed by the system must have low latency and high throughput when asynchronous.
Messaging design principles
Here are the set of design principles for Messaging separated into four categories. These rules are set up to help achieve the goals outlined under “Motivation” and are subservient to the values in “Priorities” if they were ever to conflict with one another.
- Single source of truth. There is only one source of truth for data, code, or knowledge. Any assumptions about how another system behaves falls under knowledge. Avoid structuring things that make implicit assumptions about how other systems work and only rely on the explicit contract.
- Specialization. Every service, component, and function does one thing and does it well. It exists because nothing else provides the same functionality and that functionality can be described in one or two sentences.
- Ownership. Every piece of code and every service has at least one clearly defined owner. Owners are the designated experts and have final say in what happens in their codebase.
- Limited scope. Each service, class, or function should reference things in terms that it understands. It is impossible to control clients, so avoid coupling any client knowledge to what is being built. Never design a system that forces a service to understand more than it can control.
- Accurate names. Name services, components, classes, and functions with a common vocabulary. This will help developers understand what something does at a glance. It also helps developers learn a new area of code more quickly. Name things for the underlying concepts they represent and do not hardcode things that may change.
- Documentation is always outdated. Assume that documentation is out of date as soon as it is written. Prefer self-documenting code with concise and accurate names. Use documentation to convey the author’s intent and the reason for why the code is written in a particular way. Convey knowledge that will be useful for the next developer when they try to understand the system that was built.
- Automate. Do not waste time repeating manual tasks.
- Murphy’s Law. “Anything that can go wrong, will go wrong.” Assume everything will fail and design with failure in mind. This is especially true with network calls on mobile clients, but applies more generally to every level of engineering.
- Concise interfaces. An interface should not have more than six parameters. More than six parameters means that the interface is trying to do too much all at once.
- Orthogonal parameters. Parameters to interfaces should not have invalid combinations that allow for contradictions. The user should not be able to make mistakes with a given set of input parameters. Each parameter should have a job that is independent of the other parameters.
- Immutability where possible. Treat objects as immutable where possible and strive to avoid in/out parameters.
- Isolation. Each service is the owner of its database and no other service accesses that database. This allows developers to quickly iterate without the need to coordinate between services. Downstream clients that depend on the database are inherently asynchronous and can be changed independently.
- Security on day one. Build in security from the beginning.
- Error reporting. All errors should be logged and reported. Logs should not contain personally identifiable information, but should otherwise be helpful and accurate.
- Monitoring. If it is not monitored, it is probably broken.
- Efficiency. Ensure that people do not wait for machines. Prefer consistent low latency for online systems and high throughput for offline or asynchronous systems.
- Service level agreements. Services should adhere to the interface contract and a well-defined latency limit. Clients should not have to verify that a service is producing the correct output.
- No hidden side effects. Do not write functions that contain side effects.
- No long functions. Functions longer than 50 lines are probably trying to do too much. Refactor them into smaller pieces that have well-defined scopes.
- Localized knowledge. Group related functions in classes and related business logic in services. This makes it easier for developers to discover and diagnose issues.
Every large-scale rebuild initiative will have bumps along the way even with thorough planning. We ended up making several pivots that fell into the following categories:
- Migration strategy. Due to a data migration catch-22, we had to pivot our initial strategy to maintain and preserve the integrity of our member data. It was not possible to migrate only new messages because they depended on and modified thread-level information which was not yet certified as correct.
- Technical choices. We realized that we needed a way to guarantee asynchronous processing beyond just a fire-and-forget approach. Some long running operations needed always complete successfully to provide consistency across different views of the data.
- Team structure. Even though the traditional manager hierarchy did not change, the engineering ownership shifted. We pulled in engineers from our partner teams in a joint push to accomplish this task and ownership shifted from a manager team structure to one more based on individual engineering knowledge.
- Project organization. As the project evolved, we moved from specific service owners to larger technical areas to finally reorganizing into project tracks with individual leads. We were able to maintain a high craftsmanship bar throughout this process due to the design principles mentioned earlier. One note of which we are particularly proud: We saw 24 engineers become leads for each of their tracks and execute efficiently for 6 months.
If not for the groundwork we had set early on, it’s doubtful that we could have pulled this off and smoothly cut over to full production traffic. Our design principles and team culture allowed us to scale with the size of the project as well as pivot and adjust to the problems at hand without finding ourselves stuck on specific decisions or specific people.
Overall, we were able to apply technical and organizational lessons to create a foundation for the next messaging platform. At times, the road was neither easy nor obvious, but our design principles acted as our north star. Investing early on in our goals and team organization not only helped us push through the more challenging aspects of the project, but also paid dividends in the end. If there is a single takeaway from our journey, it is that the project succeeds when the entire team is aligned to follow best practices and empowered to make good decisions.