The diagram below illustrates the architecture of the knowledge graph service as of today at Airbnb. It can be divided into 3 components: graph storage, graph query API, and storage mutator. In this section, we will get into the details for each of them.
The first thing we built for the knowledge graph infrastructure is a graph storage module. We adopted an in-house relational data store as the underlying database, on top of which we implemented a node store and edge store such that one can directly perform CRUD (create, read, update, and delete) operations on nodes (entities) and edges (relationships), instead of dealing with rows in database tables. Each node or edge is assigned with a global unique identifier (GUID). We can fetch nodes and edges with GUIDs; in addition, we can also fetch specific types of edges that connect certain nodes.
To build ontology and relationships into the knowledge graph, the nodes in the graph storage are divided into different node types. In addition, each node type is defined by a unique schema. For example, a place node is defined by the name and GPS coordinate while the event node type is defined by the name, date, and venue. These different node types are stored in separate tables in the underlying database.
Similarly, edges can be of different edge types to reflect different types of relationships among entities (such as landmark-in-city and language-spoken-in-country). In correspondence to domain and range in RDFS, each edge type has a configurable constraint for the type of nodes that it starts from and connects to. For example, a landmark-in-city has to connect from a landmark node to a city node.
Moreover, the graph storage is designed to store edges from different data sources, so that multiple teams (as data owners) can contribute data to the knowledge graph. Thus, each edge also stores the source and confidence score for each edge. To guarantee that a data owner’s operation is unlikely to affect data from other teams, we store edges from each data source in a separate table in the underlying database. The storage can also store additional payload for edges; an example is the distance between the Home listing and the landmark for a home-near-landmark edge.
At Airbnb, the data in the knowledge graph is not always consumed through online queries, so we also dump a daily snapshot of the nodes and edges into a data warehouse for offline usages. Applications, such as our auto-complete service, depend on the knowledge graph’s data dump for their product needs. In addition, we also apply machine learning technologies on the data dump for purposes including graph embedding, knowledge inference, etc.
Lastly, we’d love to reflect on our choice on the underlying database for graph storage. Why did we adopt a relational database instead of a graph database? The short answer is operation overhead. At the time, we didn’t have a production-ready graph database at Airbnb, and using the existing relation database has the following advantages:
- Our in-house relational database proved reliable as it had been widely used. It also came with a lot of useful features, such as an easy-to-use client, schema migration tools, monitoring and alerting as well as daily data export.
- Using a graph database meant we would have to set it up within Airbnb’s foundation, debug any reliability / performance issue, and develop additional features that we would need. It would slow down our progress and distract our focus on the knowledge graph itself.
So far, the graph storage has satisfying performance with the relational database. We also carefully encapsulated and consolidated the logic to deal with the database together and hide them from the rest of the knowledge graph codebase. By doing that, we have the flexibility to replace the underlying database whenever it is necessary in the future.
Graph Query API
As we started using the knowledge graph in production, we noticed that most of the product use cases needed to traverse a subgraph and retrieve nodes and edges from that traversal. For example, in Airbnb’s product detail page (PDP, or a listing page), the knowledge graph is queried to display points of interest near the Home listing, and photos for each of the restaurants, museums, or landmarks mentioned. With terminologies in graph theory, this query needs to traverse (1) all place nodes that are connected to a specific Home listing node, and (2) photo nodes connected with the place nodes fetched in the previous step.
To support these product needs, we implemented a graph query endpoint in addition to CRUD endpoints for nodes and edges in the knowledge graph API module. With a graph query, one can traverse the graph by specifying a path, which is a sequence of edge types and data filters, starting from certain nodes, and receive the traversed subgraph in a structured format. The graph query API has a recursive interface such that one can traverse the knowledge graph with multiple steps.
To give you a taste, let’s look at an example: If one wants to find all place nodes connected with the city node “Beijing” with edges of type “contains_location” such that they (1) have more than 5,000 listings around and (2) belong to the “scenic” category. This query can be written as follows.
As mentioned above, the knowledge graph is designed to store data from multiple data sources. Through our knowledge graph API, data from all sources are available to query. In a graph query, one can specify the data sources which to query data from. Meanwhile, we are also working on a data reconciliation layer, which aims to aggregate data from different sources, to reconcile conflicts and provide a consistent view of data when users don’t know which data sources to trust.
By now, the knowledge graph can perfectly support use cases such as fetching all landmarks close to a Home at Airbnb, since it can be converted to a graph query. However, there are use cases that cannot be directly supported with a graph query — for example, to fetch the most popular landmark around a Home. We are now actively investing efforts to deal with such fuzzy queries by incorporating the landmark’s metadata and the user’s personalization signals via ML.
For many of our product use cases, we need to constantly import data to the graph storage and propagate these mutations downstream. There are cases when it is suboptimal to synchronously write data through the knowledge graph API, for the following reasons:
- It is an operational burden to synchronously call the knowledge graph API in every pipeline that writes data to the knowledge graph, since the pipelines are implemented within a different tech stack (e.g. Airflow, IDL service, etc.) and each pipeline needs to deal with issues like rate limit, retrying on exception, etc.
- Writing data through the API will potentially interact with other crucial online usages (e.g., search, PDP, etc.) of the knowledge graph, especially when there is a spike in the writing traffic or when the writing path on the graph storage is faulty.
On top of the graph storage, we built a storage mutator to resolve this issue. In addition to calling the API, a data pipeline can also send a mutation request to the knowledge graph via emitting a message with a specific Kafka topic to our Kafka message bus; the mutation consumer subscribes to this topic and writes data into the knowledge graph correspondingly upon receiving the messages. This pattern facilitates the process of writing data into the knowledge graph from various pipelines and is now the primary way for us to import data. We are also planning to use it for functionalities such as storage rollback and 3rd-party data ingestion.
In the storage mutator, we also built a mutation publisher to propagate data mutations to the Kafka message bus. Downstream pipelines can consume these messages for their product use cases. An example is the search index pipeline, in which the knowledge graph populates categorization data into the search index via this pattern. We will dive into this use case in the next section.