Ashish Singh | Pinterest Engineer, Data Engineering
As a data-driven company, many critical business decisions are made at Pinterest based on insights from data. These insights are powered by the Big Data Platform team, which enables others within the company to process petabytes of data to find answers to their questions.
Analysis of data is a critical function at Pinterest not just to answer business questions, but also to debug engineering issues, prioritize features, identify most common issues faced by users, and see usage trends. As such these analytics capabilities are needed by engineers and non-engineers equally at Pinterest. SQL and its variants have proven to provide a level ground for employees to express their computational needs, or analysis, effectively. It also provides a great abstraction between user code / query and underlying compute infrastructure, enabling the infrastructure to evolve without affecting users.
To provide employees with the critical need of interactive querying, we’ve worked with Presto, an open-source distributed SQL query engine, over the years. Operating Presto at Pinterest’s scale has involved resolving quite a few challenges. In this post, we share our journey.
Figure 1 below gives an overview of Presto deployment at Pinterest. Our infrastructure is built on top of Amazon Web Services (AWS) EC2 and we leverage AWS S3 for storing our data. This separates compute and storage layers, and allows multiple compute clusters to share the S3 data. We have multiple Presto clusters that serve different use cases. These clusters can be long or short-lived. Two major ones are ad-hoc and scheduled cluster: the former serves ad-hoc queries and the latter serves scheduled queries. Keeping ad-hoc queries separate from scheduled queries enables us to provide better SLA for scheduled queries, and also brings more predictability in resource demand on the scheduled cluster.
Pinterest’s analytical need was served by a more conventional data warehouse that didn’t scale with Pinterest’s data size until 2016, which was then replaced by Presto. Running Presto at Pinterest’s scale came with its own challenges. In the early days of onboarding Presto, we frequently saw issues including Presto coordinator crashes and cluster getting stuck with close to zero worker parallelism. Later in this blog, we explain the reasons for these issues and discuss how we solved them.
We have hundreds of petabytes of data and tens of thousands of Hive tables. Our Presto clusters are comprised of a fleet of 450 r4.8xl EC2 instances. Presto clusters together have over 100 TBs of memory and 14K vcpu cores. Within Pinterest, we have close to more than 1,000 monthly active users (out of total 1,600+ Pinterest employees) using Presto, who run about 400K queries on these clusters per month.
Presto is well known for its capability to query from various systems, however, only the Hive connector is currently used at Pinterest. Hive and Presto both share the same Hive Metastore Service. It’s very common for our users to use Hive for writing data, and Presto for read-only analysis. In addition, we recently started allowing Presto to create tables and insert data primarily due to the following reasons.
- Capability to run big queries We limit queries by their runtime and the data they process on Presto. Write support provides an alternative way to run big queries by breaking them into smaller queries. Each small query can read from previous queries’ output and write to an intermediate table which is then consumed by the next query. This is a better approach to dealing with big queries, as it provides easy debuggability, modularity, sharing and checkpointing. If one sub-query fails, only that sub-query and subsequent sub-queries need to be re-run, and not the entire big query, which saves time and resources/ money.
- Supporting workflows: Impressed by Presto’s processing speed, users have been asking for support for defining workflows on top of Presto. With only read capability, Presto either could only have served at the end of a flow providing final output, or Presto output would have been brought in memory of workflow system and then passed to next job/ execution. Both of these approaches would have been very limiting. With Presto supporting write, it can be easily used within a flow.
Each Presto cluster at Pinterest has workers on a mix of dedicated AWS EC2 instances and Kubernetes pods. Presto deployment at Pinterest should look very similar to any large scale Presto deployments. There are a couple of in-house pieces, i.e., Presto Controller and Presto Gateway, that we talk about in the next subsections.
Presto controller is a service built in-house that is critical to our Presto deployments. Following are some of the major functionalities served by the controller as of today.
- Health check
- Slow worker detection
- Heavy query detection
- Rolling restarts of Presto clusters
- Scaling clusters
Presto gateway is a service that sits between clients and Presto clusters. It essentially is smart http proxy server. We got a head start on this by using Lyft’s [Presto-Gateway](https://github.com/lyft/presto-gateway). Since then, we’ve added many functionalities on top of it, and we plan on contributing those functionalities back to Lyft’s version. This service makes clients agnostic of specific Presto clusters and enables the following usages. Some of these features are in active development and we are slowly moving all of our clients from talking to specific clusters to Presto Gateway.
- Rules-based routing of queries
- Resource usage limits and current usage visibility for users
- Overall Presto clusters’ health visibility
Each query submitted to Presto cluster is logged to a Kafka topic via Singer. Singer is a logging agent built at Pinterest and we talked about it in a previous post. Each query is logged when it is submitted and when it finishes. When a Presto cluster crashes, we will have query submitted events without corresponding query finished events. These events enable us to capture the effect of cluster crashes over time. JMX and host OS metrics are logged to OpenTSDB via tcollector that runs on all Pinterest hosts. Using metrics from OpenTSDB, Presto real-time dashboards are published on Statsboard (Metrics monitoring UI at Pinterest). This has been handy for debugging service issues. Statsboard also has an alerting mechanism which is tied to PagerDuty.
There are a few options for users to interact with Presto. Most common ones are DataHub (an in-house web UI tool), Jupyter, and Tableau. However, there are quite a few custom tools that are powered via Presto.
To scale the Presto usage at Pinterest, we cautiously decided which pain points to prioritize. We utilized data collected from Presto clusters and Presto query logs to derive informative metrics. Below are a few.
- Which tables are slow while reading?
- Which queries when run together can crash or stall the cluster?
- Which users/ teams are running long queries?
- What is a good threshold for a config?
- P90 and P99 query runtimes?
- Query success rates?
Deeply nested and huge thrift schemas
Coordinator in a Presto cluster is very important for the entire cluster operation. As such, it’s also a single point of failure. Until mid last year, our Presto version was based on open source Presto version 0.182. Since then many bug fixes and improvements have been made to the coordinator to better cope with its critical responsibilities. However, even with improvements, we found our Presto clusters’ coordinators would get stuck or even crash with out of memory (OOM).
One of the most common reasons for crashes was very large and very deeply nested thrift schemas, which are very common among our Hive tables. For instance, a popular and commonly used large Thrift schema has over 12 million primitives and a depth of 41 levels. This schema when serialized to string takes over 282 MB. We have close to 500 hive tables with over 100K primitives in their schemas.
In Presto, it’s the responsibility of the coordinator to fetch schemas of tables from Hive Metastore for Hive catalog and then serialize that schema in each task request it sends to workers. This design keeps Hive Metastore service from getting bombarded with hundreds of requests simultaneously from Presto workers. However, this design has an adverse effect on coordinator memory and network when schemas are very large.
Fortunately, our large and deeply nested schemas issue is only limited to tables using Thrift schemas. In our deployments, a Thrift schema Java archive (jar) file is created and put into the classpaths of coordinator and each worker of a Presto cluster and is loaded at service start time. A new jar with updated schemas is created and reloaded during daily service restart. This enabled us to completely get rid of Thrift schemas from tasks’ requests: instead, only a Thrift class name is passed as part of the request, which has helped stabilize Presto coordinator in deployments by a huge factor.
Presto gains a part of its efficiency and speed from the fact that it always has JVMs up and is ready to start running tasks on workers. A single JVM is shared for multiple tasks from multiple queries on a Presto worker. This sharing often used to lead to a heavy query slowing down all other queries on a cluster. Enforcing memory constraints with resource groups, which enforces limits on the amount of memory a query can consume at a given time on a cluster, went a long way to resolve these issues in a highly multi-tenant cluster. However, we still used to see clusters coming to a standstill. Queries would get stuck, worker parallelism would drop to zero and stay there for a long time, communication error started popping up and queries started getting timed out.
Presto uses a multilevel feedback queue to ensure slow tasks aren’t slowing down all tasks on a worker. This can lead to a worker having a lot of slow tasks accumulated over time, as quick tasks would be prioritized and will quickly finish. Slow IO tasks can also accumulate on a worker. As mentioned, all our data sits in AWS S3 and S3 can throttle down requests if a prefix is being hit hard, which can further slow down tasks. If a worker is slow or stuck, the slowness gradually spreads through the Presto cluster. Other workers waiting on pages from a slow worker would slow down and will pass down the slowness to other workers.
Solving this problem requires a good detection and a fair resolution mechanism. We resorted to following checks to detect workers getting slow.
- Check if a worker’s CPU utilization is lower than cluster’s average CPU utilization over a threshold and this difference is sustained over some time.
- Check if a number of queries are failing with internal errors, indicating failure while talking to a worker over a threshold over some time.
- Check if a worker has open file descriptors higher than a threshold for more than some time.
Once a worker matches any of the above criteria, Presto Controller would mark the worker for a shutdown. A graceful shutdown is first attempted, however failure to gracefully shutdown a worker in a few attempts will lead to controller forcibly terminating the EC2 instance for dedicated workers or shutting down the Kubernetes pod hosting the worker.
As shown in Figure 1, we have multiple Presto clusters at Pinterest. To efficiently utilize the available resources across all Presto clusters, a new query should be sent to an under-utilized cluster or resources from an under-utilized cluster must be moved to a cluster where the query is going to run. It would be easier to do the former, however at Pinterest different Presto clusters have different access patterns and different characteristics. Some clusters are tuned for very specific types of queries/ use-cases that run on them. For instance, running ad-hoc queries on the scheduled cluster, which is meant to run only scheduled queries, will interfere with scheduled cluster usage pattern analysis and can also adversely affect the queries on the cluster. This interaction between queries is why we prefer moving resources from under-utilized clusters to over-utilized clusters.
Moving a dedicated EC2 instance from one cluster to another would have required us to terminate and re-provision the instance. This process can easily take close to or more than ten minutes. Ten minutes in the Presto world, where our P90 query latency is less than five minutes, is a long time. Instead, the Kubernetes platform provides us with the capability to add and remove workers from a Presto cluster very quickly. The best-case latency on bringing up a new worker on Kubernetes is less than a minute. However, when the Kubernetes cluster itself is out of resources and needs to scale up, it can take up to ten minutes. Some other advantages of deploying on Kubernetes platform is that our Presto deployment becomes agnostic of cloud vendor, instance types, OS, etc.
Presto controller service is responsible for adding/ removing workers on Kubernetes. We have a static count of workers on Kubernetes today for each cluster. However, we plan to soon auto-scale clusters based on current demand, and also historic trends of demands on these clusters.
Each night we restart all Presto clusters to load updated configuration parameters, Thrift schemas, custom Hive Serializer/Deserializer (SerDe) and User Defined Functions (UDFs).
Ability to shut down a service without affecting any running tasks is an important aspect of a service (usually referred to as graceful shutdown). In open source Presto, there is no way to initiate a graceful shutdown of a cluster. Presto operators at various organizations handle graceful shutdowns by controlling traffic to the clusters. We’re starting to do the same at Pinterest too with Presto Gateway. However, currently, there are some clients that talk to a specific Presto cluster and get affected by ungraceful cluster shut down. Even with Presto Gateway, we’ll still have some clients that will continue to talk to specific Presto clusters without going through Presto Gateway, either due to security reasons or the fact that there is just one cluster serving a specific use-case.
In Presto, one can perform a graceful shutdown of a worker. However, that alone is not sufficient to ensure graceful shutdown for an entire cluster. We added a graceful shutdown capability to Presto coordinator to perform a graceful shutdown of an entire cluster. When a cluster graceful shutdown is initiated, a shutdown request is sent to the Coordinator of the cluster. On receiving a graceful shutdown request, similar to Presto Workers, the Coordinator changes its state to SHUTTING_DOWN. In this state, Presto coordinator does not accept any new query and waits for all existing queries to finish before shutting down itself. In this state, the coordinator responds with an error to any new query, informing the client that cluster is shutting down and asking them to retry in some time, usually around maximum allowed query runtime. This fail-fast with informative message alone is much better than previous behavior of clients seeing abrupt failures, prompting them to simply retry the queries only to see those failures again. In the future, we plan to add a capability to reload jars without restarting processes and make some configuration parameters dynamic to get away from the need to restart clusters frequently.
As shown in Figure 1, we have various clients connecting to Presto clusters. Some of these are services that allow users to run queries. For resourcing and accounting purposes, it is required that these services impersonate each user on whose behalf they are running a query. This is possible to do out of the box on Presto if Kerberos authentication is used. We use LDAP authentication, which does not have a way of connecting services to impersonate and restrict only allowed services to be able to do so. We added impersonation support to LDAP authenticator that takes a configurable whitelist of services that can perform impersonation.
Presto is very widely used and has played a key role in enabling analysis at Pinterest. Being one of the very popular interactive SQL querying engines, Presto is evolving very fast. Recent versions of Presto have a lot of stability and scalability improvements. However, for Pinterest scale, we’ve had to resolve a few issues along the way for successful Presto operation and usage, like graceful cluster shutdown, handling of large deeply nested thrift schemas, impersonation support in LDAP authenticator, slow worker detection and auto-scaling of workers. Some of these can benefit the community too, and we plan on contributing back.
In the future, we want to continue making reliability, scalability, usability and performance improvements, like rolling restarts, reloading of jars without needing to restart cluster and visibility into cluster resource utilization for users. We’re also very interested in on-demand checkpointing of tasks to enable seamless usage of Amazon EC2 spot instances and enabling our users to be able to get a query runtime estimate without waiting for query completion.
Acknowledgments: Huge thanks to Pucheng Yang, Lida Li and entire Big Data Platform team who helped in improving and scaling Presto service at Pinterest.