Editor’s note: In this two-part blog series, we introduce LIquid, a new graph database built by LinkedIn to support human real-time querying of the economic graph. It is a complete implementation of the relational model that supports fast, constant-time traversal of graph edges with a relational graph data model that is simple and self-describing, yet still manages to support the definition and indexing of complex n-ary relationships and property graphs. LIquid’s interface is a declarative query language based on Datalog. LIquid’s query processing engine achieves high performance by using dynamic query planning on wait-free shared-memory index structures.
Part 1 below will describe how graph data relates to applications, specifically the in-memory graph of structures commonly used as “models” in the sense of “Model-View-Controller” as understood by graphical applications. Since the reason for having a database is ultimately applications, this seems like a good place to start.
In part 2, we will describe how graph data can be stored in a conventional relational database, and why conventional relational databases don’t work very well for this purpose. Rather than scrapping the relational model completely, we will go on to show how relational graph data can be processed efficiently.
Introducing LIquid, LinkedIn’s in-house graph database
Our team at LinkedIn has spent the last four years building a new graph database named LIquid. LIquid is a single, automated service that replaces the hand-coded queries that sometimes added several hundred milliseconds to page load times for members. It can be queried with a powerful general-purpose query language and return the necessary results in an optimal fashion.
Why does LinkedIn need a graph database? We’ll get to a more formal answer later in this post, but here’s the intuition: the value of an economic graph for the average member lies mostly in their second degree network. These are the connections of your connections, such as the colleague of an old school friend, or the new boss of a co-worker from a prior job.
On LinkedIn, your first degree network is likely small, typically averaging a few hundred current or former coworkers, and other people that you already know. Your third degree network is likely very large, but it is hard to act on, say by leveraging these relationships to get a job, because doing so requires making two sequential new connections. However, the set of second degree connections is typically at least 50,000 entities (e.g.,people, schools, employers). It is large enough to be diverse, but at the same time, actionable: you are one new connection away from something that can happen. In fact, your next job is probably in your second degree network.
A first degree network of, for example, 250 connections, would be easily handled by a simple table or key-value store. Second degree networks are a much more challenging software engineering problem because the set of second degree connections is too large to pre-materialize and store, and the write amplification involved in keeping a pre-materialized second degree connection set up to date—roughly 250 times the base write rate, or once for each first degree connection—makes this approach impractical. On the other hand, computing second degree connections on demand is no picnic either. The first degree is easy to materialize, but thereafter, the join to produce the second is daunting, particularly if you have to search a table of billions of edges stored in conventional sorted storage (typically some sort of terabyte-scale B-tree). For an average first degree, this join will be extremely sparse, effectively 250 random look-ups.
For these reasons, LinkedIn has built a series of increasingly general-purpose graph serving systems to provide real-time access to the core of the economic graph. Initially, the graph only stored connections in a very simple format, a tuple of (source, dest, score). Subsequent versions were extended to include other types of edges such as employment and education. However, edges were limited to 2 integer endpoints and 64 bits of attribute data, (type, source, dest, attributes). LIquid is the most recent of these systems, and the first one which is general purpose: a database that implements the relational model, fully described by a schema and accessed by a declarative query language with functionality comparable to SQL. In LIquid, edges are triples of strings, (subject, predicate, object), and compounds can be built out of edges to represent n-ary relationships with arbitrary attributes.
From objects to graphs
Faced with the problem of working with second degree connections, most programmers would probably start out by putting everything in memory as objects and chasing pointers. For application programmers, this has been the predominant approach to modeling the world since Smalltalk popularized object-orientation and the model-view-controller approach to application development in the 1970s. While sequential access is always faster, random access in memory is fast enough that programmers are pretty blithe about adding an extra layer of indirection. The stunning complexity of modern UI suggests that second degree networks, which might take thousands of random accesses to produce, should be tractable to work with in human real time. Browsers, word processors, and spreadsheets all traverse tens or hundreds of thousands of pointers rapidly enough for people to consider them instantaneous.
Since our goal is to build applications, let’s start with a conventional application model, a graph of objects. If we were to build a graph of authors and books, it might look something like this: