Amenity Detection and Beyond — New Frontiers of Computer Vision at Airbnb


Build highly customized AI technologies into home-sharing products and help our guests belong anywhere.

Authors:

What amenities are there in this image? Can an algorithm detect them all? How can we train the algorithm effectively? If you are interested in the answers, please read on!

In 2018, we published a blog post titled . In that post, we introduced an image classification model which categorized listing photos into different room types and helped organize hundreds of millions of listing photos on the Airbnb platform. Since then, the technology has been powering a wide range of internal content moderation tools, as well as some consumer-facing features on the Airbnb website. We hope such an image classification technology makes our business more efficient, and our products more pleasant to use.

Image Classification is a sub-field of a broader technology called Computer Vision, which deals with how computer algorithms can be made to gain understandings of digital images or videos. Another related sub-field is Object Detection, which deals with detecting instances of semantic objects of a certain class in digital images or videos.

Airbnb has millions of listings worldwide. To make sure our listings uphold high standards of quality, we need to determine whether the amenities advertised online match the actual ones. At our scale, using only human efforts to do so is obviously neither economical nor sustainable. Object Detection technologies, however, can lend us a helping hand, as amenities can be automatically detected in listing photos. Furthermore, the technology opens a new door to a home sharing platform where listing photos are searchable by amenities, which helps our guests navigate through listings much more easily.

From Generic to Customized Solutions

Object Detection technologies evolve rapidly. Just a few years ago, the idea to build an object detection model to detect amenities in a digital picture might sound prohibitively difficult and intimidating. Nowadays, a great number of decent solutions have already emerged, some of which require minimal efforts. For example, many third-party vendors provide generic object detection APIs which are usually quite cost-effective and easy to integrate into products.

We tested a few different API services with our listing photos. Unfortunately, the results suggested that the APIs had quite noticeable gaps with our business requirement. The image below shows the results for a sample picture from our test data.

A sample amenity detection result of a third-party API service, from an industry-leading vendor

Even though the API service is able to detect certain amenities, the labels predicted are too vague. In a home sharing business like Airbnb, knowing that exists in a picture does not tell us much other than the room type. Likewise, knowing there is a table in the picture doesn’t help us either. We don’t know what kind of a table it is, or what it could be used for. Our actual goal is to understand whether the detected amenities provide convenience for guests. Can guests cook in this home? Do they have specific cookware they want? Do they have a of a decent size to host enough people if it is a family trip? A more desirable amenity detection result would look like something as below.

A sample result from Airbnb’s amenity detection model, with more specific labels

As one can see, the predicted labels are much more specific. We can use these results to verify the accuracy of listing descriptions and serve Homes searches to guests with specific amenity requests.

In addition to the third-party API, open source projects like Tensorflow Detection Model Zoo and Detectron Model Zoo also offer a collection of free pre-trained object detection models, using different public image datasets and different model architectures. We tested various pre-trained models from the Model Zoos. Likewise, the results did not meet our requirement either. Precision was significantly lower and some predicted labels were just far off.

To do that, we needed to first determine what a customized set of amenities should be, build an image dataset based on that set of amenity labels, and have the images annotated with the labels. Through training against images with these annotations, we hope that the model can learn how to recognize these amenities, and then locate each instance detected. This is quite a long journey, and in the following sections, we will share how we walked through this whole process.

Defining the Taxonomy

Taxonomy is a scheme of amenity labels. Defining a taxonomy that encompasses amenities of our interest is a rather open-ended question. Ideally, the taxonomy should come from a specific business need. In our case, however, the taxonomy was unclear or varied from business units, so we bore the responsibility to come up with a minimal viable list first. This was quite a stretch as data scientists due to the limitations of our scope, but we believed that it could be a common problem to solve in many organizations. Our strategy was to get started with something lightweight, and then to iterate fast. !

Lacking prior experience, we decided to start from something people had worked on before, hopefully to find some hint. We found that Open Image Dataset V4 offered a vast amount of image data. It included about 9M images that had been annotated with image-level labels, object bounding boxes (BB), and visual relationships. In particular, BB annotations span a rich set of 600 object classes. These classes formed a hierarchical structure, and covered a wide spectrum of objects. On the highest level, they included , , , , , , and a collection of household items, such as , and . Our goal was to find out the object classes that were relevant to amenities and to filter out the rest.

We manually reviewed the 600 classes and selected around 40 classes that were relevant to our use case. They were generally important amenities in , and , such as , , , , , , , etc. Open Image Dataset V4 saved us a lot of time. If we were to start from scratch, building a reasonable taxonomy alone would have taken us a long time.

Building an Image Dataset

After the taxonomy was determined, the next step would be to collect image data based on it. Open Image Dataset V4 had 14.6M BBs annotated in 1.7M images. Ideally, we could get a large number of image samples from it since our taxonomy was basically a subset of the complete 600 classes. However, as we dived deeper into the data, we found that the 600 object classes were highly imbalanced. Some classes had millions of instances while others only had a few.

[Source] Class label distribution. To avoid over-plotting, the horizontal axis only shows one label name every eight. Please note that the vertical axis of the histogram is in log scale, which means the counts of class instances on the right side are orders of magnitude less than the ones on the left side.

The 40 classes of our interest mostly fell onto the minority (right) side of the class label distribution shown above. As a result, we ended up with only 100k instances of objects, annotated from about 50k images — about just 3% of the whole dataset. We had overestimated the amount of data available significantly!

Modern object detection models are almost exclusively based on deep learning, which means they need a lot of training data to perform well. A general rule of thumb is that a few thousand image samples per class could lead to a decent model performance. 50k images annotated with 40 object classes implied 1.2k images per class on average, which was adequate, but not great. Therefore, we decided to add some in-house data, and to fuse it with the public data. To make sure the internal dataset includes rich, diverse, and evenly distributed amenity classes, we sampled 10k images for , , and each, and additional 1k images for outdoor scenes such as , and each.

Creating Annotations

Many vendors provide annotation services for object detection tasks. The basic workflow is that customers provide a labeling instruction and the raw data. The vendor annotates the data based on the labeling instruction and returns the annotations. A good labeling instruction makes the process move smoothly and yields high-quality annotations, and is therefore extremely important. . Writing a thoughtful one all at once is usually impossible, especially if you are doing this for the first time, so be prepared to iterate.

In this project we chose Google data labeling service which three things we really liked about: 1) supporting up to 100 object classes for labeling, 2) a nice and clean UI where we could monitor the progress of the labeling job, and 3) feedback and questions were constantly sent to us as the labeling work moved forward. As a result, we were able to make vague instructions clear and address edge cases in the whole process.

. In our experience, we found some amenities were ubiquitous and less useful (e.g. ) in these small-batch results. We took them out accordingly and refined our taxonomy from 40 classes down to 30. Afterward we had our data completely annotated in about two weeks.

Flowchart of data preparation. Constantly iterate from taxonomy to data annotation.

Model Training

Combining the labeled 43k internal images and the 32k public images, we ended up with 75k images annotated with 30 customized amenity objects. Now it was time to actually build the model!

We tried two paths building the model. One path was to leverage the Tensorflow Object Detection API — creating based on the annotated image data, using the to kick off the training, and running Tensorboard to monitor the training progress. There were many online tutorials on how to do that. We will skip most details here and only cite our favorite one.

In particular, we chose two pre-trained models for fine tuning: and . was fast, but with lower accuracy, and was the opposite. To set up a benchmark, we tested the accuracy of both pre-trained models on a 10% held-out data (7.5k images with 30 object classes) before fine tuning. We use mean Average Precision () as the metric, which is standard for evaluating an object detection model. It measures the average precision ( of a curve) of a model across all object classes, and ranges between 0 and 1. More details are explained here.

achieved of 14% and achieved of 27%. A careful reader may find that our benchmark results for these two pre-trained models were much lower than the reported on the website: 36% for , and 54% for This was not incorrect. Our test set had only 30 classes, all of which were minority ones in the dataset where the pre-trained model was trained. The degradation of accuracy for the pre-trained models was due to a shift in class distribution between training and test sets.

To start training on our dataset, we froze parameters in feature extraction layers, and only made the fully connected layers trainable. The default initial learning rate was for and for . Since we were doing transfer learning, we lowered the learning rates to only 10% of the default values. The rationale was that we did not want to make too large of a gradient update to “destroy” what had already been learned in the pre-trained model weights. We also decreased the number of training steps from 10M to 1M and scaled corresponding decay parameters in the learning rate schedule. In terms of computing resource, an AWS instance, with Tesla K80 single-core GPU was used for the training job.

When training , the loss function decreased very fast in the beginning. However, after 100k steps (5 days), the improvement became marginal, and the loss function began to oscillate. Unsure if continuing the training would still make sense, we stopped the training after 5 days because the progress was anyways too slow. The model accuracy () increased from 14% to 20%.

When training , the loss function started very small, but immediately began to show lots of wild behavior. We were not able to improve the of the model at all.

We estimated that an of at least 50% was needed to build a minimal viable product, and there was obviously still a big gap. By this point we had spent a lot of time dealing with model training. We hypothesized that the loss function probably got stuck at some local minima, and we needed to apply some numerical tricks to jump out of it. The diagnosis would be quite involved. Meanwhile, switching to other model architecture which was easier to retrain was definitely an option too. We decided to leave off there, and planned to revisit the problem in the future.

Another path to build the model was through an automated self-service tool. We tried Google AutoML Vision. Surprisingly, the results were very impressive. Just by uploading the 75k annotated images and clicking a few buttons, we were able to train an object detection model in 3 days. We opted in higher accuracy in the self-service menu so the training took longer than usual.

Model Evaluation

We chose to use the model trained by AutoML. The model achieved an of about 68% based on our offline evaluation in a 10% held-out data (7.5k images). The result is significantly higher than all the metrics we had seen so far. Certain classes had particularly great performance, such as , and, all achieving 90%+ average precision. The worst performing classes were and . We found that the average precision of each object class was strongly correlated with its prevalence in the training data. Therefore, increasing training samples for these minority classes would likely improve the performance of those categories a lot.

Breakdown of average precision for different object classes.
Precision-recall curve of predictions for “Bed.”

In our offline evaluation, we also found that was quite sensitive to training-test split. A different split due to simple statistical randomness could lead to 2–3% drift. The major instability of came from minority classes where the sample size was very small. .

Model Deployment and Online Serving

Model deployment on AutoML was also extremely easy, with only one click. After deployment, the model was turned into an online service which people could easily use through REST API or a few lines of Python code. Queries-per-Second () can vary depending on the number of node hours deployed. A big downside of AutoML though was that you could not download the original model you created. This was a potential problem and we decided to revisit it in the future.

Finally, we want to demonstrate the performance of our model by showing a few more concrete examples, where our customized model will be compared against an industry-leading third-party API service. Please note that our taxonomy only includes 30 amenity classes, but the API service includes hundreds.

Our model is able to detect pillows which are important for listing review in our Plus business. Our model does mistakenly categorize a vase as a towel. The 3rd-party model predicts vague concept such as furniture.
Our model is able to detect Sink, Towel, Toilet, and Shower Area. The 3rd-party model misses most of them. Bathroom cabinet is not included in our taxonomy as we did not think it is useful enough at this point.
Our model reliably predicts amenities defined in our taxonomy. Please note that the towel in the mirror is not detected and this is on purpose. We did not want to double count mirrored amenities in our labeling process, and the model successfully learned our rules!
Our model gives a very comprehensive list of amenities in this studio. The 3rd-party model again predicts vague concepts. Chair and table are not in our taxonomy because we did not think they provide important information.
Our model is able to detect a lot of amenities in such a small area of the kitchen. Please note that the 3rd-party model made a false detection for the utensil caddy.
Our model is able to detect billiard table, mirror, TV, and couch very confidently. The 3rd-party model misses a lot of the key amenities.
Treehouse is an iconic Home type on the Airbnb platform. Even with minimal data, we are able to detect them in certain cases. Please note that our confidence score is pretty low, meaning there’s still room to improve for this category. The 3rd-party model only predicts House, which is not very informative in our use case.
Our model is able detect a swimming pool, with extremely high confidence. The 3rd-party model only detects House, which again is not very helpful for us.

As one can see, our model is considerably more specific and provides much wider coverage for these 30 object classes. A more quantitative comparison between our model and the third-party model in this demo needs some careful thoughts. First, our taxonomy is different from that of the third-party model. When calculating mAP, we should include only the intersection of the two class sets. Second, the third-party API only shows partial results as any predictions with a confident score less than 0.5 would have been filtered and not observable by us. This basically truncates the right side of the curve of their results where is high (and threshold is low), and thus lowers the of their results. To make a fair comparison, we should truncate our results by removing detections whose scores are less than 0.5 too. After the treatment, we calculated that the “truncated” for our model was 46%, and 18% for theirs. It is really encouraging to see that our model can significantly outperform a third-party model from an industry-leading vendor. The comparison also demonstrates how important domain-specific data is in the world of computer vision.

A fair comparison between Airbnb customized model and the 3rd-party generic model, using “truncated” mAP. Note that many categories in the 3rd-party model have simply zero AP. This is due to too small data representation in their training data.

Broad-scope Object Detection

In addition to Amenity Detection, Object Detection with broader scope is another important area that we are investing in. For example, Airbnb leverages many third-party online platforms to advertise listings. To provide legitimate ads, we need to make sure the displayed listing photos do not pose excessive privacy or safety risks for our community. Using Broad-scope Object Detection, we are able to perform necessary content moderation to prevent things like weapons, large-size human faces, etc. from being exposed without protection. We are currently using Google’s Vision service to power this. We are also building a configurable detection system called Telescope, which can take actions on images with additional risky objects when necessary.

Image Quality Control

Another growing need in our business is to use AI to assist our process of quality control for listing images. As an early adopter, we leveraged a new technology that Google is working on, and designed a better set of catalog selection criteria for both re-marketing and prospecting campaigns.

The technology supports two models: One predicts an aesthetic score, and the other predicts a technical score. The aesthetic score evaluates the aesthetic appeal of the content, color, picturing angle, sociality, etc., while the technical score evaluates noise, blurriness, exposure, color correctness, etc. In practice, they complement each other very well. For example, the image below was once in our advertisement catalog. Now that we understand that it has a high technical score but a poor aesthetic score, we can comfortably replace it with a more attractive yet still informative image.

A sample listing photo with low aesthetic score for display ads.

In addition, the image quality assessment scores are also tested in our listing recommendation model. Offline results show that they are in the top-3 most important features of the model. We are actively tracking their impact in online experiments and plan to extend the applications if results are positive.

To conclude this blog post, we’d like to share a few key lessons learned in our journey of applying Computer Vision technologies to Airbnb’s business.

  1. In the era of deep learning, data becomes much more important than the model. To solve a problem as a data scientist, you will probably spend 90% of your time collecting and parsing big chunks of data.
  2. Be creative when gathering data, and don’t reinvent the wheel. Leverage public data from the open source community when possible, and integrate it with your private data if necessary.
  3. Until breakthrough happens in unsupervised learning algorithms, getting high-quality labels for your data is almost always the most critical step for a supervised model. Having your data labeled is often the most time-consuming process as well because there may be lots of coordination efforts between organizations. Plan early and choose a vendor for annotations wisely.
  4. Using a good machine learning tool can significantly speed up your model training and deployment, which makes it faster to deliver your model as a service.
  5. Be open minded. Don’t be afraid to start with a simple solution, even if it’s just a generic third-party API. It may not solve your business problem immediately, but will likely lead to a successful solution sometime later.

Computer Vision applications such as Amenity Detection, Broad-scope Object Detection and Image Quality Control help Airbnb become a smarter and safer home-sharing platform for our hosts and guests. We hope that these technologies will make our business more efficient and unlock full potentials for the future.

, ,



Source link