Medium’s DynamoDB Data Source for Apache Spark – Medium Engineering

Medium’s tech stack includes DynamoDB for our production data, and Apache Spark for backend data processing. We need a fast and reliable integration between these systems to support our data warehouse. Talking with data engineers at other companies, this is a common pairing and we all have somewhat similar internal tools making these systems play well together.

We’ve open-sourced our DynamoDB Data Source for Apache Spark with the goal of making DynamoDB & Spark easy to use together for everyone.

Key features:

Scanning DynamoDB tables in parallel is particularly beneficial because they are partitioned, requiring parallel scans to take advantage of all your provisioned read capacity. We have observed significant scan time improvements when switching to the parallel scanner.

75% improvement with parallel scanner

RDD Integration

Our primary use-case at Medium is using Spark to backup DynamoDB tables to S3. Separate Spark jobs restore the backups — JSON files on S3 which Spark has great support for — into a variety of systems, including Redshift and Databricks. We chose this approach (rather than Data Pipeline) so our ETL and analysis jobs have a common look & feel, keeping things simple for our users and developers.

We start our backup jobs with a command similar to:

--class com.github.traviscrawford.spark.dynamodb.DynamoBackupJob
--packages com.github.traviscrawford:spark-dynamodb:0.0.5 ''
-table users -output s3://myBucket/users
-totalSegments 4 -rateLimit 100

DynamoDB table scans work by defining a set of ScanSpec’s, or description of what data to scan, including which portion of the table to scan, and which projections and/or filters to apply.

This data source creates ScanSpec’s in the Spark driver based on the provided configuration, then distributes them to cluster workers where the table is scanned in parallel.

As a convenience, we provide a backup job that writes the data set to any supported filesystem. For more control, you can use the table scanner as a library.

DataFrame Integration

As a NoSQL database, DynamoDB does not strictly enforce schemas. However, if all records in a table share a schema it may be useful to load the table as a DataFrame.

Here’s an example running SQL on a DynamoDB table. Notice how no schema was provided — it’s inferred by sampling some records in the table. You can optionally specify the schema if needed (we generate ours from protobufs).

import com.github.traviscrawford.spark.dynamodb._

val users ="users").cache()
val data = sqlContext.sql(
"select username from users where username = 'tc'")

You can perform any DataFrame operation on the data that was scanned from the DynamoDB table, such as continued analysis, or saving to S3/Redshift.

Source link