In the machine learning community, Apache Spark is widely used for data processing due to its efficiency in SQL-style operations, while TensorFlow is one of the most popular frameworks for model training. Although there are some data formats supported by both tools, TFRecord—the data format native to TensorFlow—is not fully supported by Spark. While there have been prior attempts to bridge the gap between these two systems (Spark-Tensorflow-Connector, for example), existing implementations leave out some important features provided by Spark.
In this post, we introduce and open source a new data source for Spark, Spark-TFRecord. The goal of Spark-TFRecord is to provide full support for the native TensorFlow data format in Spark. The intent of this project is to uplevel TFRecord as a first-class citizen in the Spark data source community to be on par with other internal formats, such as Avro, JSON, Parquet, etc. Spark-TFRecord provides not only the simple functions, such as data frame read and write, but also advanced ones, such as PartitionBy. As a result, a smooth data processing and training pipeline in TFRecord is possible.
Both TensorFlow and Spark are widely used at LinkedIn. Spark is used in many data processing and preparation pipelines. It is also the leading tool for data analytics. As more business units employ deep learning models, TensorFlow has become the mainstream modeling and serving tool. Open source TensorFlow models mainly use the TFRecord data format, while most of our internal datasets are in Avro format. In order to use open source models, we have to either change the model source code to take Avro files, or convert our datasets to TFRecord. This project facilitates the latter.
Existing projects and prior efforts
Prior to Spark-TFRecord, the most popular tool to read and write TFRecord in Spark has been Spark-Tensorflow-Connector. It is part of the TensorFlow ecosystem, and has been promoted by Databricks, the creator of Spark. Although it supports basic functions such as read and write, we noticed two disadvantages of its implementation for our use cases at LinkedIn. First, it is based on the RelationProvider interface. This interface is mainly for connecting Spark and a database (hence the name “connector”). In this case, the disk read and write operations are provided by the database. However, the main use case of Spark-Tensorflow-Connector is disk I/O operations, rather than connecting a database. In the absence of a database, the I/O operations have to be provided by the developers who implement the RelationProvider interface. This is why a considerable amount of code in Spark-Tensorflow-Connector is dedicated to various disk read and write scenarios.
In addition, Spark-Tensorflow-Connector lacks important functions such as PartitionBy, which splits the dataset according to a certain column. We find this function useful at LinkedIn when we need to train models for each entity, because it allows us to partition the training data by the entity IDs. Demand for this function runs high in the TensorFlow community, as well.
Spark-TFRecord fills these gaps by realizing the more versatile FileFormat interface, which is also used by other native formats such as Avro and Parquet. With this interface, all the DataFrame and DataSet I/O APIs are automatically available to TFRecord, including the sought-after PartitionBy function. In addition, future Spark I/O enhancements are automatically available through the interface.
We initially considered patching Spark-Tensorflow-Connector to obtain the PartitionBy function that we needed. But after examining its source code, we realized that RelationProvider, which Spark-Tensorflow-Connector is based on, is a Spark interface to SQL databases, making it not suitable for our purpose. Unfortunately, there does not exist a simple fix since RelationProvider is not designed to provide disk I/O operations. Instead, we took a totally different route and implemented FileFormat, which is designed for file-based I/O operations. This was helpful for our use cases at LinkedIn, where datasets are typically directly read from and written to disk, making FileFormat a more proper interface for those tasks.
The following diagram shows the building blocks.