Large-scale deep learning models are changing the way we think about images of homes on our platform.
Authors: Shijing Yao, Qiang Zhu
Airbnb is a marketplace featuring millions of homes. Travelers around the world search on the platform and discover the best homes for their trips. Aside from location and price, listing photos are one of the most critical factors for decision-making during a guest’s search journey. However until very recently, we knew very little about these important photos. When a guest interacted with listing photos of a home, we had no way to help guests find the most informative images, ensure the information conveyed in the photos was accurate or advise hosts about how to improve the appeal of their images in a scalable way.
Thanks to the recent advancement in computer vision and deep learning, we are able to leverage technology to solve these problems at scale. We started with a project that aimed to categorize our listing photos into different room types. For one thing, categorization makes possible a simple home tour where photos with the same room type can be grouped together. For another, categorization makes it much easier to validate the number of certain rooms and check whether the basic room information is correct. Going forward, we believe there are lots of exciting opportunities to further enhance our knowledge of image content on Airbnb. We will show some examples at the end of this post.
The ability to correctly classify the room type for a given listing photo is incredibly useful for optimizing the user experience. On the guest side, it facilitates re-ranking and re-layout of photos based on distinct room types so that the ones people are most interested in will be surfaced first. On the host side, it helps us automatically review listings to ensure they abide by our marketplace’s high standards. Accurate photo categorization is the backbone for these core functions.
The first batch of room types we sought to classify included Bedrooms, Bathrooms, Living Rooms, Kitchens, Swimming Pools and Views. We expect to add other room types based on the needs from product teams.
The room-type classification problem largely resembles the ImageNet classification problem except that our model outcomes are customized room- types. This makes the off-the-shelf state-of-the-art deep neural network (DNN) models such as like VGG, ResNet and Inception not directly applicable in our case. There are a number of great posts online which tell people how to cope with this issue. Basically we should 1) modify the last (top) few layers of the DNN to make sure the output dimension matches ours and 2) re-train the DNN to certain degree and achieve satisfactory performance. After a few experiments with these models, we chose ResNet50 as our powerhouse due to its good balance between model performance and computation time. To make it compatible with our use case, we added two extra fully connected layers and a Softmax activation in the end. We also experimented with a few training options, which will be discussed in the next section.
Re-train a Modified ResNet50
Re-training ResNet50 falls in three scenarios:
- Keep the base ResNet50 model fixed and only re-train the added two layers using minimal data. This is also often called fine-tuning.
- Do the same fine tuning as in 1, but with much more data.
- Re-train the whole modified ResNet50 from scratch.
Most of the online tutorials use the first approach because it’s fast and usually leads to decent results. We tried the first approach and indeed got some reasonable initial results. However in order to launch high-quality image product, we needed to improve the model performance dramatically — ideally achieving 95%+ precision, and 80%+ recall.
To achieve high precision and high recall simultaneously, we realized using massive data to re-train the DNN was inevitable. However there were two major challenges: 1) Even though we had lots of listing photos uploaded by hosts, we didn’t have accurate room-type labels associated with them, if any at all. 2) Re-training a DNN like ResNet50 was highly non-trivial — There were more than 25 million parameters to train and this required substantial GPU support. These two challenges will be addressed in the next two sections.
Supervision With Image Captions
Many companies leverage third-party vendors to obtain high-quality labels for image data. This is obviously not the most economical solution for us, when millions of photos need to be labeled. To balance cost and performance, we approached this labeling problem in a hybrid way. On one side, we asked vendors to label relatively small number of photos, usually in thousands or tens of thousands. This chunk of labeled data would be used as a golden set for us to evaluate models. We used random sampling to get this golden set and ensured the data was unbiased. On the other side, we leveraged image captions created by hosts as a proxy for room-type information and extracted labels out of it. This idea was huge for us because it made the expensive labeling task essentially free. We only needed a judicious way to ensure that room-type labels extracted from image caption were accurate and reliable.
A tempting method to extract room-type label from image caption is as follows: If a certain room type keyword is found in the caption of an image, the image will be labeled as that type. However the real world is more complicated than that. If you examined the results of this rule, you’d be very disappointed. We found numerous cases where the image caption was far off the actual content of that image. Below are a few bad examples.
To filter out bad examples like this, we added extra rules when extracting room-type labels from image captions. After several rounds of filtering and checking, the label quality was greatly improved. Below is an example for how we filtered Kitchen data to obtain relatively “clean” Kitchen images.
Due to these extra filters, we lost quite a lot of image data. This was okay for us because even with such an aggressive filtering, we still ended up with a few million photos, a few hundred thousand in each room type. More over, the label quality of these photos were now much better. Here we assumed the data distribution didn’t shift with the filtering, which would be validated once we tested out the model on an unbiased golden dataset.
Having said that, we might have been able to use some NLP Techniques to dynamically cluster image captions instead of using rule-based heuristics. However we decided to stay with heuristics for now, and pushed NLP work to the future.
Model Building, Evaluating, and Production
Re-training a DNN like ResNet50 using a few million images requires a lot of computational resources. In our implementation, we used an AWS P2.8xlarge Instance with Nvidia 8-core K80 GPU, and sent a batch of 128 images to 8 GPUs per training step. We did parallel training with Tensorflow as the backend. We compiled the model after parallelizing it because otherwise the training wouldn’t work. To further speed up training, we initialized model weights with pre-trained imagenet weights loaded from keras.applications.resnet50.ResNet50. The best model was obtained after 3 epochs of training, which lasted about 6 hours. Afterward the model started to overfit and the performance on validation set stopped improving.
One important note is that we built in production multiple binary-class models for different room types instead of building a multi-class model to cover all room types. This was not ideal but since our model serving was mostly offline, the extra delay due to multiple model calls affected us minimally. We will transit to a multi-class model in production soon.
We evaluated our models based on precision and recall. We also monitored metrics like F1 score and accuracy. Their definitions are reiterated as below. In a nutshell, precision describes how confident we are about the accuracy of our positive predictions, and recall describes how much percent our positive predictions cover all actual positives. Precision and recall usually go against each other. In our context, we set a high bar (95%) for precision because when we claim the photo is a certain room type, we should really have a high confidence about that claim.
A confusion matrix is the key to calculate these metrics. Our model’s raw output is a probability score from 0 to 1 for each image. To compute a confusion matrix for a set of predictions, one has to first set a particular threshold to translate the predicted scores into 0 and 1. A precision-recall (P-R) curve is then generated by sweeping the thresholds from 0 to 1. In principle the closer to 1 the AUC (Area Under Curve) of a P-R curve is, the more accurate the model is.
In evaluating the models, we used the aforementioned golden set where the ground truth labels were provided by humans. Interestingly we found accuracy differed from room type to room type. Bedroom and Bathroom models were the most accurate ones while other models were less accurate. For brevity, we only show the P-R curve of a Bedroom and Living Room here. The cross point of the dotted lines represents the final performance given a particular threshold. We append a summary of the metrics on the chart.
There are two important observations:
- The overall performance of the Bedroom model is much better than that of the a Living Room. There could be two explanations: 1) A Bedroom is easier to classify than a Living Room because Bedroom setting is relatively standard while Living Room can have a lot more varieties. 2) The labels extracted from Bedroom photos have higher quality than those extracted from Living Room photos since Living Room photos occasionally also include Dining Rooms or even Kitchens.
- Within each room type, a fully re-trained model (red curve) has better performance than the partially re-trained (blue curve) model, and the gap is larger between Living Room models than between Bedroom models. This suggests re-training a full ResNet50 model has different impact for difficult room types.
For the 6 models we shipped, precision is generally above 95% and recall is generally above 50%. By setting different threshold values people can make trade-offs. The model is set to power a number of different products across multiple product teams inside Airbnb.
The users compared our results of to well-known third-party image recognition APIs. It was reported that the in-house model overall outperformed third-party generic models. This implies by taking advantage of your own data, you have a chance to outperform even the industry state-of-the-art model for a particular task you are interested in.
At the end of this section, we’d like to showcase a few concrete examples that exemplify the power of this model.
When doing this project, we also tried a few interesting ideas beyond room type classification. We want to show two examples here and give people an idea how exciting these problems are.
Unsupervised Scene Classification
When we first tried out room type classification using pre-trained ResNet50 model, we generated image embeddings (2048×1 vectors) for listing cover- page photos. To interpret what these embeddings meant, we projected these long vectors onto a 2D plane using PCA techniques. Much to our surprise, the projected data are naturally clustered into two groups. Looking into these two clusters, we found that the left group were almost exclusively indoor scenes and the right group were almost exclusively outdoor scenes. This meant without any re-training and simply by setting a cut line on the first principal component of the image embedding, we were able determine indoor and outdoor scenes. This finding opened the door to some really interesting domain where transfer learning (embedding) met unsupervised learning.
Another area that we tried pursuing was object detection. A pre-trained Faster R-CNN model on Open Images Dataset already provided stunning results. As you see in the example below, the model is able to detect Window, Door, Dining Table and their locations. Using Tensorflow Object Detection API, we did some quick evaluations on our listing photos. A lot of the home amenities could be detected using the off-the-shelf result. In the future, we plan to retrain the Faster R-CNN model using Airbnb’s customized amenity labels. Since some of these labels are missing in the open source data, We will likely create labels on our own. With these algorithm-detected amenities, we are able to verify the quality of the listings from hosts and make it much easier for guests to find homes with specific amenity needs. This will push the frontier of photo intelligence at Airbnb to the next level.
Here are a few key take-aways that might be helpful for other deep learning practitioners:
First, deep learning is nothing but one particular kind of supervised learning. One cannot overestimate the importance of high-quality labels to the data. Since deep learning usually requires a significant amount of training data to achieve state-of-the-art performance, finding an efficient way to do labeling is crucial. Fortunately we found a hybrid approach which is economical, scalable and reliable.
Second, training a DNN like ResNet50 from scratch can be quite involved. Try to start in a simple and fast way — train only top layers using a small dataset. If you do have a large trainable dataset, re-training a DNN from scratch might give you state-of-the-art performance.
Third, parallelize the training if you can. In our case we gained about 6x (quasi-linear) speed-up by using 8 GPUs. This makes building a complex DNN model computationally viable and it is much easier to iterate over hyper parameters and model structures.