LIquid: The soul of a new graph database, Part 2

This allows us to have distinct relative identities for a predicate: “the subject side,” sm, and “the object side,” om, without bothering the user to come up with unique identifiers for them.

LIquid’s nearest neighbors 

Let’s define what we’ve built. A “graph database” is an implementation of the relational model with the following properties:

  1. All relations are equal and first-class. While a useful form of index for predictable access patterns, employing tables as the physical data model results in inequities as soon as there is more than one table. Tuples expressed in a single physical table are vastly more performant than those expressed by joining two or more tables. Lastly, the lack of domain support in tabular relational systems makes the user’s intentions with respect to possible joins opaque. In a graph database, the existence of an edge is exactly expressive of user intent.

  2. Edge traversal is fast and constant-time. If everything is stored as edges, we’re going to be doing a lot of joining, specifically self-joins. Every query with small intermediate results will almost certainly devolve into random access.

  3. Query results are a subgraph. Application models are complex and applications need to be able to retrieve arbitrarily complex models in a single query with a single consistency guarantee.

  4. Schema-evolution is constant-time. If schema-evolution (really, the addition of new predicates) takes non-constant time, then the database implementation must be pretending to operate with edges while maintaining structs or tables under the covers.

There are many things in the world that call themselves graph databases, roughly divided into three families:

  1. RDF, SPARQL: triple stores

  2. Property graphs: Neo4J, Facebook’s Tao, various Gremlin implementations

  3. SQL and SQL extensions such as SQL Server’s graph support

Aside from #3, none of them have clear relationships to the relational model. None of them support n-ary compound edges. Query evaluation is either relational algebra-based, or imperative (traversal-based).

RDF is quite close to LIquid’s Edge in spirit, but has introduced a great deal of complexity. Triples are asymmetric. Actually, they are quads. There is a language tag that is difficult to square with both unicode string encoding, which is multilingual, and no type system (there are several) that can express cardinality constraints like: “one name in any language.” SPARQL seems to have inherited most of SQL’s ailments.

Property graphs introduce a second schema for properties and force the user to choose between representing data as edges or as properties of nodes or edges. A node with properties is basically an object and suffers from all of the drawbacks of OODBs. The obvious encoding of a property graph as a table of vertices (with properties) and a table of edges (with properties) will force two lookups for any query involving properties of vertices and edges.

While many graph database implementations are effectively “in RAM,” none explicitly optimize for random access with hash-based indexing, or exploit fast random access to set sizes in query evaluation. Nothing targets complex queries or supports the query composition necessary to do so easily.

We have defined what a graph database is and described the implementation of one that supports the Datalog query language. We have also suggested how to index graph data for fast, constant-time access and how to construct a dynamic query planner based on constraint satisfaction. The graph data model is simple and universal, and features a self-describing schema which allows us to specify relationship cardinality and value domains. And, this schema allows us to define n-ary compound edges, which we can use to implement property-graphs. So, “Why aren’t we done?”

Yeah actually, we’re pretty much done.

Did we mention that LIquid is fun to use?

Team LIquid

Thanks to team LIquid at LinkedIn for many years of hard work making this database a reality. In particular: Bogdan Arsintescu, Roman Averbukh, Andrew Carter, Juan Colmenares, Ionut Constandache, Peter Li, Scott Meyer, Andrew Rodriguez, Siddharth Shah, Yiming Yang, and Jiajun Yao.

Source link