Microservice Architecture at Medium – Medium Engineering

Microservice Strategies

Adopting the microservice architecture is not trivial. It could go awry and actually hurt engineering productivity. In this section, we will share seven strategies that helped us in the early stage of adoption:

  • Build new services with clear value
  • Monolithic persistent storage considered harmful
  • Decouple “building a service” and “running services”
  • Thorough and consistent observability
  • Not every new service needs to be built from scratch
  • Respect failures because they will happen
  • Avoid “microservice syndromes” from day one

Build New Services with Clear Value

One may think adopting a new server architecture means a long pause of product development and a massive rewrite of everything. This is the wrong approach. We should never build new services for the sake of building new services. Every time we build a new service or adopt a new technology, there must be clear product value and/or engineering value.

Product value should be represented by benefits we can deliver to our users. A new service is required to make it possible to deliver the values or make it faster to deliver the values compared to building it in the monolithic Node.js app. Engineering value should make the engineering team better and faster.

If building a new service does not have either product value or engineering value, we leave it in the monolithic app. It is totally fine if in ten years Medium still has a monolithic Node.js app that supports some surfaces. Starting with a monolithic app actually helps us model the microservices strategically.

Monolithic Persistent Storage Considered Harmful

A big part of modeling microservices is to model their persistent data storage (e.g., databases). Sharing persistent data storage across services often appears to be the easiest way to integrate microservices together, however, it is actually detrimental and we should avoid it at all cost. Here is why.

First of all, persistent data storage is about implementation details. Sharing data storage across services exposes the implementation details of one service to the entire system. If that service changes the format of the data, or adds caching layers, or switches to different types of databases, many other services have to be changed accordingly as well. This violates the principle of loose coupling.

Secondly, persistent data storage is not service behaviors, i.e., how to modify, interpret and use the data. If we share data storage across services, it means other services also have to replicate service behaviors. This violates the principle of high cohesion — behaviors in a given domain are leaked to multiple services. If we modify one behavior, we will have to modify all of these services together.

In microservice architecture, only one service should be responsible for a specific type of data. All the other services should either request the data through the API of the responsible service or keep a read-only non-canonical (maybe materialized) copy of the data.

This may sound abstract, so here is a concrete example. Say we are building a new recommendation service and it needs some data from the canonical post table, currently in AWS DynamoDB. We could make the post data available for the new recommendation service in one of two ways.

In the monolithic storage model, the recommendation service has direct access to the same persistent storage that the monolithic app does. This is a bad idea because:

  • Caching can be tricky. If the recommendation service shares the same cache as the monolithic app, we will have to duplicate the cache implementation details in the recommendation service as well; if the recommendation service uses its own cache, we won’t know when to invalidate its cache when the monolithic app updates the post data.
  • If the monolithic app decides to change to use RDS instead of DynamoDB to store post data, we will have to reimplement the logic in the recommendation service and all other services that access the post data as well.
  • The monolithic app has complex logic to interpret the post data, e.g., how to decide if a post should not be viewable to a given user. We have to reimplement those logics in the recommendation service. Once the monolithic app changes or adds new logics, we need to make the same changes everywhere as well.
  • The recommendation service is stuck with DynamoDB even if it is the wrong option for its own data access pattern.

In the decoupled storage model, the recommendation service does not have direct access to the post data, neither do any other new services. The implementation details of post data are retained in only one service. There are different ways of achieving this.

Ideally, there should be a Post Service that owns the post data and other services can only access post data through the Post Service’s APIs. However, it could be an expensive upfront investment to build new services for all core data models.

There are a couple of more pragmatic ways when staffing is limited. They could be actually better ways depending on the data access pattern. In Option B, the monolithic app lets the recommendation services know when relevant post data is updated. Usually, this doesn’t have to happen immediately, so we can offload it to the queuing system. In Option C, an ETL pipeline generates a read-only copy of the post data for the recommendation service, plus potentially other data that is useful for recommendations as well. In both options, the recommendation service owns its data completely, so it has the flexibility to cache the data or use whatever database technologies that fit the best.

Decouple “Building a Service” and “Running Services”

If building microservices is hard, running services is often even harder. It slows the engineering teams down when running services is coupled with building each service and teams have to keep reinventing the ways of doing it. We want to let each service focus on its own work and not worry about the complex matter of how to run services, including networking, communication protocols, deployment, observability, etc. The service management should be completely decoupled from each individual service’s implementation.

The strategy of decoupling “building a service” and “running services” is to make running-services tasks service-technology-agnostic and opinionated, so that app engineers can fully focus on each service’s own business logic.

Thanks to the recent technology advancements in containerization, container-orchestration, service mesh, application performance monitoring, etc, the decoupling of “running service” becomes more achievable than ever.

Networking. Networking (e.g., service discovery, routing, load balancing, traffic routing, etc) is a critical part of running services. The traditional approach is to provide libraries for every platform/language. It works but is not ideal because applications still need a non-trivial amount of work to integrate and maintain the libraries. More often than not, applications still need to implement some of the logic separately. The modern solution is to run services in a Service Mesh. At Medium, we use Istio and Envoy as sidecar proxy. Application engineers who build services don’t need to worry about the networking at all.

Communication Protocol. No matter which tech stacks or languages you choose to build microservices, it is extremely important to start with a mature RPC solution that is efficient, typed, cross-platform and requires the minimum amount of development overhead. RPC solutions that support backward-compatibility also make it safer to deploy services even with dependencies among them. At Medium, we chose gRPC.

A common alternative is REST+JSON over HTTP, which has been the blessed solution for server communication for a long time. However, although that stack is great for the browsers to talk to servers, it is inefficient for server-to-server communication, especially when we need to send a large number of requests. Without automatically generated stubs and boilerplate code, we will have to manually implement the server/client code. Reliable RPC implementation is more than just wrapping a network client. In addition, REST is “opinionated”, but it can be difficult to always get everyone to agree on every detail, e.g., is this call really REST, or just an RPC? Is this thing a resource or is it an operation? etc.

Deployment. Having a consistent way to build, test, package, deploy and manage services is very important. All of Medium’s microservices run in containers. Currently, our orchestration system is a mix of AWS ECS and Kubernetes, but moving towards Kubernetes only.

We built our own system to build, test, package and deploy services, called BBFD. It strikes a balance between working consistently across services and giving individual service the flexibility of adopting different technology stack. The way it works is it lets each service provide the basic information, e.g., the port to listen to, the commands to build/test/start the service, etc., and BBFD will take care of the rest.

Thorough and Consistent Observability

Observability includes the processes, conventions, and tooling that let us understand how the system is working and triage issues when it isn’t working. Observability includes logging, performance tracking, metrics, dashboards, alerting, and is super critical for the microservice architecture to succeed.

When we move from one single service to a distributed system with many services, two things can happen:

  1. We lose observability because it becomes harder to do or easier to be overlooked.
  2. Different teams reinvent the wheel and we end up with fragmented observability, which is essentially low observability because it is hard to use fragmented data to connect the dots or triage any issues.

It is very important to have good and consistent observability from the beginning, so our DevOps team came up with a strategy for consistent observability and built tools in support of achieving that. Every service gets detailed DataDog dashboards, alerts, and log search automatically, which are also consistent across all services. We also heavily use LightStep to understand the performance of the systems.

Not Every New Service Needs to be Built from Scratch

In microservice architecture, each service does one thing and does it really well. Notice that it has nothing to do with how to build a service. If you migrate from a monolithic service, keep in mind that a microservice doesn’t always have to be built from scratch if you can peel it off from the monolithic app.

Here we take a pragmatic approach. Whether we should build a service from scratch depends on two factors: (1) how well Node.js is suited for the task and (2) how much it costs to reimplement in a different tech stack.

If Node.js is a good technical option and the existing implementation is in a good shape, we peel the code off from the monolithic app and create a microservice with it. Even with the same implementation, we will still get all the benefits of microservice architecture.

Our monolithic Node.js monolithic app was architected in a way that make it relatively easy for us to build separate services with the existing implementation. We will discuss how to properly architect a monolithic later in this post.

Respect Failures Because They Will Happen

In a distributed environment, more things can fail, and they will. Failures of mission-critical services, when not handled well, could be catastrophic. We should always think about how to test failures and gracefully handle failures.

  • First and foremost, we should expect everything will fail at some point.
  • For RPC calls, put extra effort to handle failure cases.
  • Make sure we have good observability (mentioned above) to failures when they happen.
  • Always test failures when bringing a new service online. It should be part of the new service check-list.
  • Build auto-recovery if possible.

Avoid Microservice Syndromes from Day One

Microservice is not a panacea — it solves some problems, but creates some others, which we call “microservice syndromes”. If we don’t think about them from day one, things can get messy fast and it costs more if we take care of them later. Here are some of the common symptoms.

  • Poorly modeled microservices cause more harm than good, especially when you have more than a couple of them.
  • Allow too many different choices of languages/technology, which increase the operational cost and fragment the engineering organization.
  • Couple running services with building services, which dramatically increases the complexity of each service and slow the team down.
  • Overlook data modeling and end up with microservices with monolithic data storage.
  • Lack of observability, which makes it difficult to triage performance issues or failures.
  • When facing a problem, teams tend to create a new service instead of fixing the existing one even though the latter may be a better option.
  • Even though the services are loosely coupled, lack of a holistic picture of the whole system could be problematic.

Source link