Introducing Neuropod, Uber ATG’s Open Source Deep Learning Inference Engine


At Uber Advanced Technologies Group (ATG), we leverage deep learning to provide safe and reliable self-driving technology. Using deep learning, we can build and train models to handle tasks such as processing sensor input, identifying objects, and predicting where those objects might go.

As our self-driving software evolves, we are always looking for new ways to improve our models and, sometimes, that means experimenting with different deep learning (DL) frameworks. As new DL frameworks are released and existing frameworks like TensorFlow and PyTorch advance, we want to ensure that our engineers and scientists have flexibility to use the tools that are best suited to the problems they’re working on. 

Unfortunately, adding support for a new deep learning framework across an entire machine learning stack is resource and time-intensive. At ATG, we’ve spent a lot of time working to make that process easier. Today, we’re open sourcing these efforts by introducing Neuropod. Neuropod is an abstraction layer on top of existing deep learning frameworks that provides a uniform interface to run DL models. Neuropod makes it easy for researchers to build models in a framework of their choosing while also simplifying productionization of these models.

Using multiple deep learning frameworks

Deep learning (DL) is advancing very quickly and different DL frameworks are effective at different tasks. As a result, we’ve used several DL frameworks at Uber ATG over the last few years. In 2016, Caffe2 was our primary deep learning framework, and in early 2017 we put in a significant amount of work to integrate TensorFlow. This involved major integration hurdles with CUDA and cuDNN, conflicts between dependencies of Caffe2 and TensorFlow, library loading issues and more. In late 2017, we started developing more models in PyTorch. Productionizing those models took a lot of work as well. We saw memory corruption issues when running alongside TensorFlow and several other problems that were very difficult to debug. Then TorchScript was released and we went through a similar process again.

Figure 1. Over the years, Uber ATG has evolved its approach to machine learning, leveraging different popular deep learning frameworks.

 

Since 2018, several DL frameworks have been open-sourced including Ludwig (which was built by Uber!), JAX, Trax, Flax, Haiku, and RLax with many of these being released in the last 9 months.

Even if it’s easy for a researcher to experiment with new frameworks, adding production support for a new DL framework throughout all our systems and processes is no small task. Part of what makes this so difficult is that it requires integration and optimization work for each piece of our infrastructure and tooling.

Figure 2. It’s difficult to add support for a new framework because every piece of infrastructure that runs a model needs to support all frameworks. These infrastructure components could be metrics pipelines, model serving, or other production and testing environments.

 

In late 2018, Uber ATG started to build multiple models that solved the same problem in different ways. For instance, 3D object detection using LiDAR can be done in range view or bird’s eye view. Both of these approaches are valid and have different pros and cons. Models built by different teams were also sometimes implemented in different frameworks.

To make productionization easier, we wanted to be able to easily swap out models that solve the same problem, even if they were implemented in different frameworks.

We’d also see some cases where new research would come out with code in PyTorch and we wanted to quickly compare with existing models in TensorFlow. Since we had framework-specific, model-level metrics pipelines for each model, it was hard to do this.

To deploy a model in a new framework, we’d need to rebuild model-level metrics pipelines, integrate the framework across all of our systems and processes, and then make additional optimizations to ensure we were running models efficiently and within our latency budgets.

While these steps may seem simple, issues like the ones described above (memory corruption, dependency conflicts, etc.) caused us to spend significant effort working through integration issues instead of being able to focus on model development.

We needed a way to maximize flexibility during research without having to redo work during other parts of the process.

Introducing Neuropod

Our solution to this problem was Neuropod, an open source library that makes DL frameworks look the same when running a model. Now, adding support for a new framework across all of your tooling and infrastructure is as simple as adding it to Neuropod.

Figure 3. Neuropod is an abstraction layer that provides a uniform interface to run deep learning models from multiple frameworks.

 

Currently, Neuropod supports several frameworks including TensorFlow, PyTorch, Keras, and TorchScript, while making it easy to add new ones.

Neuropod has been instrumental in quickly deploying new models at Uber since its internal release in early 2019. Over the last year, we have deployed hundreds of Neuropod models across Uber ATG, Uber AI, and the core Uber business. These include models for demand forecasting, estimated time of arrival (ETA) prediction for rides, menu transcription for Uber Eats, and object detection models for self-driving vehicles. With Neuropod’s open source release, we hope others in the machine learning community will find it helpful as well! 

Overview

Neuropod starts with the concept of a problem definition a formal description of a “problem” for models to solve. In this context, a problem could be something like semantic segmentation of images or language translation of text. By formally defining a problem, we can treat it as an interface and abstract away the concrete implementations. Every Neuropod model implements a problem definition. As a result, any models that solve the same problem are interchangeable, even if they use different frameworks.

Neuropod works by wrapping existing models in a neuropod package (or “a neuropod” for short). This package contains the original model along with metadata, test data, and custom ops (if any).

With Neuropod, any model can be executed from any supported language. For example, if a user wants to run a PyTorch model from C++, Neuropod will spin up a Python interpreter under the hood and communicate with it to run the model. This is necessary because PyTorch models require a python interpreter to run. This capability lets us quickly test PyTorch models before putting in effort to convert them to TorchScript, which can be run natively from C++. 

Neuropod currently supports running models from Python and C++. However, it’s straightforward to write additional language bindings for the library. For example, Michelangelo, Uber’s ML platform, uses Neuropod as its core DL model format and implemented Go bindings to run their production models from Go.

Figure 4. Neuropod features a framework-specific packaging API and a framework agnostic inference API.

 

Inference overview

Before we dig into how Neuropod works, let’s look at how deep learning models are traditionally integrated into applications:

Figure 5. Normally, applications directly interact with a deep learning framework throughout the inference process.

 

In the figure above, the application directly interacts with the TensorFlow API at all parts of the inference process.

With Neuropod, the application only interacts with framework-agnostic APIs (everything in purple below) and Neuropod translates these framework-agnostic calls into calls to the underlying framework. We do this efficiently, using zero-copy operations whenever possible. See the “Optimize” section below for more details.

Figure 6. With Neuropod, applications interact with a framework-agnostic API and Neuropod interacts with the underlying framework.

 

Neuropod has a pluggable backend layer and every supported framework has its own implementation. This makes it straightforward to add new frameworks to Neuropod.

Deep learning with Neuropod

Let’s take a look at the overall deep learning process when using Neuropod to see how it helps make experimentation, deployment, and iteration easier.

Problem definition

To package a neuropod, we must first create a problem definition. As mentioned above, this is a canonical description of the inputs and outputs of the problem we’re trying to solve. This definition includes the names, data types, and shapes of all the input and output tensors. For example, the problem definition for 2D object detection may look something like this:

Note that the above definition uses “symbols” in the shape definitions (num_classes and num_detections). Every instance of a symbol must resolve to the same value at runtime. This provides a more robust way of constraining shapes than just setting shape elements to None. In the above example, num_detections must be the same across boxes and object_class_probability.

For the sake of simplicity, we’re going to use a simpler problem in this post: addition.

The code snippet above also defines test input and output data which we’ll discuss below.

Generate a placeholder model

Once we’ve defined a problem, we can use Neuropod to automatically generate a placeholder model that implements the problem specification. This allows us to start integration without having an actual model yet.

The generated model takes in the inputs described in the problem specification and returns random data that matches the output specification.

Build a model

After we’ve established a problem definition (and optionally generated a placeholder model), we can build our model. We go through all our normal steps of building and training a model, but now we add a Neuropod export step to the end of the process:

In the above code snippet, we export the model as a neuropod along with optional test data. If test data is provided, the library will run a self-test on the model immediately after export.

The options for create_tensorflow_neuropod  and all the other packagers are well documented.

Build a metrics pipeline

Now that we have our model, we can build a metrics pipeline for this problem in Python. The only difference from how we’d do this without Neuropod is that instead of using framework specific APIs, we now use the Neuropod Python library to run the model.

The Neuropod documentation contains more details about load_neuropod and infer.

Any metrics pipelines built with Neuropod only have to be built once per problem, and from there, become the source of truth when comparing models (even across frameworks).

Integrate

Now, we can integrate our model into a production C++ system. The example below shows very simplistic usage of the Neuropod C++ API, but the library also supports more sophisticated usage, allowing for efficient, zero-copy operations and wrapping existing memory. Please refer to the Neuropod documentation for more details.

Unlike the integration process without Neuropod, this step has to happen only once per problem, not once per framework. Users don’t need to figure out the intricacies of the TensorFlow or Torch C++ API, but can still provide a lot of flexibility when researchers decide what framework they want to use.

Also, because the core Neuropod library is in C++, we can write bindings for various other languages (including Go, Java, etc.).

Optimize

At Uber ATG, we have pretty strict latency requirements so there are zero-copy paths for many critical operations. We put a lot of work into profiling and optimization and now Neuropod can become a central place to implement inference optimizations that apply to all models.

As part of this work, every Neuropod commit is tested on the following platforms in our Continuous Integration (CI) pipelines:

    • Mac, Linux, Linux (GPU)
    • 5 versions of Python
    • 5 versions of each supported deep learning framework

Check out our documentation for an up to date list of supported platforms and frameworks.

Neuropod also includes a way to run models in worker processes using a high performance shared-memory bridge. This lets us isolate models from each other without introducing a significant latency penalty. We’ll go over this in more detail at the end of the post. 

Iterate

Once we’ve built and integrated the first version of our model, we can iterate and make improvements to our solution.

As part of this process, if we want to try a TorchScript model in place of the TensorFlow model we created above, it’s a drop-in replacement.

Without Neuropod, we would need to redo a lot of the previous steps. With Neuropod, any models that implement the same problem specification are interchangeable. We can reuse the metrics pipeline we created earlier along with all the integration work we did.

All systems and processes that run a model become framework agnostic, providing much more flexibility when building a model. Neuropod lets users focus on the problem they’re trying to solve rather than the technology they’re using to solve it.

Problem-agnostic tools

Although Neuropod starts by focusing on a “problem” (like 2D object detection given an image, sentiment analysis of text, etc), we can build tooling on top of Neuropod that is both framework-agnostic and problem-agnostic. This lets us build general infrastructure that can work with any model.

Canonical input-building pipelines

An interesting problem-agnostic use case for Neuropod is canonical input-building pipelines. At ATG, we have a defined format/specification for how we represent our input data in tensors. This covers sensor data including LiDAR, radar, and camera images along with other information like high-resolution maps. Defining this standard format makes it easy to manage extremely large datasets on the training side. This also lets us quickly build new models as many of our models use a subset of this input data (e.g. a model that operates on just camera and LiDAR).

By combining this general input format with Neuropod, we can build a single optimized input building pipeline that is used by all of our models regardless of what framework they’re implemented in.

The input builder is problem agnostic as long as every model uses a subset of the same set of features. Any optimizations in the input builder help improve all of our models.

Figure 7. Neuropod lets us build a single, optimized input-building pipeline that works with many models instead of building a separate one for every model or every framework.

 

Model serving

Another useful piece of problem-agnostic infrastructure is model serving. At Uber ATG, some of our models are fairly heavyweight in the sense that they take large inputs and the models themselves are large. For some of these models, it isn’t feasible to run them on CPU in tasks like offline simulation. Including GPUs in all our cluster machines doesn’t make sense from a resource efficiency perspective so instead, we have a service that lets users run models on remote GPUs. This is quite similar to gRPC-based model serving.

Without Neuropod, a model serving platform would need to be good at running Keras remotely, TensorFlow remotely, PyTorch remotely, TorchScript remotely, etc. It might need to implement serialization, deserialization, interactions with the C++ API, and optimizations for each framework. All applications would also need to be good at running models from all those frameworks locally. 

Because running locally and remotely have different implementations, the overall system needs to care about 2 * # of frameworks cases.

However, by using Neuropod, the model serving service can get really good at running neuropods remotely and Neuropod can get really good at running models from multiple frameworks.

By separating concerns, the number of cases the system has to care about scales additively (2 + # of frameworks) instead of multiplicatively. This difference becomes more stark as more frameworks are supported.

We can make this even more powerful, however. If we add model-serving as a Neuropod backend, any piece of infrastructure can easily run models remotely.

Figure 8. Adding model-serving as a Neuropod backend allows any application that uses Neuropod to run models remotely without significant changes.

 

Under the hood, it might look something like this:

Figure 9. Applications can use Neuropod to proxy model execution to a remote machine.

 

This solution is not problem specific or framework specific and provides even more flexibility to applications that use Neuropod.

Out-of-process execution

Figure 10. Neuropod includes support for running models in isolated worker processes using a low-latency shared-memory bridge.

 

Since Uber ATG’s inputs are generally fairly large and we’re latency sensitive, local gRPC isn’t an ideal way to isolate models. Instead, we can use an optimized shared memory based out-of-process execution (OPE) implementation that avoids copying data. This lets us isolate models from each other without introducing a significant latency penalty.

Such isolation is important because in the past we’ve seen conflicts between frameworks running in the same process including subtle CUDA bugs and memory corruptions. These issues are all really difficult to track down.

This isolation is also useful for performance when running Python models because it lets users avoid having a shared GIL across all models.

Running each model in a separate worker process also enables some additional features that are included in the “next steps” section below.

Next steps

Neuropod has allowed Uber to quickly build and deploy new deep learning models, but that’s just the start.

Some things we’re actively working on include:

      1. Version selection: This functionality gives users the ability to specify a required version range of a framework when exporting a model. For example, a model can require TensorFlow 1.13.1 and Neuropod will automatically run the model using OPE with the correct version of the framework. This enables users to use multiple frameworks and multiple versions of each framework in a single application.
      2. Seal operations: This feature enables applications to specify when they’re “done” using a tensor. Once a tensor is sealed, Neuropod can asynchronously transport the data to the correct destination (e.g. a GPU locally or to a worker process, etc.) before inference is run. This helps users parallelize data transfer and computation.
      3. Dockerized worker processes: Doing this allows us to provide even more isolation between models. For example, with this feature, even models that require different CUDA versions can run in the same application.

As we continue to expand upon Neuropod by enhancing current features and introducing new ones, we look forward to working with the open source community to improve the library.

Neuropod has been very useful across a variety of deep learning teams at Uber and we hope that it’ll be beneficial to others, too. Please give it a try and let us know what you think!

Special thanks to Yevgeni Litvin who was instrumental in making Neuropod a success.



Source link