Creating the “Superclass”: Improving object detection at Wayfair via product class clustering


Executive Summary

Object detection is an important component of the computer vision workflow at Wayfair. The goal of object detection is to detect bounding boxes of the objects in an image along with the class information (Figure 1), which empowers many downstream applications. In this post, we will share a novel method we employ at Wayfair to improve the performance of the object detection model we use by making better use of our training data. The underlying idea of this method can be applied to other object detection tasks as well.

 

Our Challenge: Insufficient Training Data

Training an object detection model requires manually annotated training data including bounding box coordinates and class information. The classes in the training data can be divided into majority classes and minority classes. In our case, a minority class is defined as having less than X (on the order of 1000) training examples.

Obtaining more training data for the minority classes is sometimes impractical and/or expensive. As we developed the baseline model for object detection, we simply left out the minority classes and trained with only the majority classes. The result was a model with relatively low class-coverage. The challenge naturally became: can we improve the class-coverage by making better use of the existing training data?

 

Our Solution: The Concept of Superclass

To answer the question above, we came up with a method centered around the concept of “superclass.” A superclass is defined as a combination of several classes that are visually similar. For example, “sofa” and “loveseat” are two different classes that resemble one another. We discovered that there exists a sufficient amount of training data for “sofa,” but not for “loveseat.” Hence, we grouped “sofa” and “loveseat” into a new class (superclass) and created a new class label. In practice, we had X (on the order of 1000) classes to be merged into Y(<X) superclasses. This necessitated a scalable machine learning approach.

 

How to Create a Superclass

Here is a high-level overview of how we created the superclass: First, we created feature vectors to represent all classes and ensured the feature vectors of similar looking classes stay close to each other in the feature space. We then clustered such vectors into N number of groups with each group representing a superclass. We also deployed a quantitative method to determine the best N among various clustering results. The following sections describe such steps in detail.

 

Step 1: Create embedding vectors

As mentioned above, the first step was to create feature vectors to represent all classes. To that end, we extracted the embedding vectors from a product-class classifier. The original purpose of the classifier was to detect the image class (for example, whether the image contains a sofa or lamp). Our hypothesis was that the classifier had already learned to extract proper visual features, and we could repurpose it as a feature extractor. Similar looking images would have similar feature vectors created by this feature extractor.

 

Figure 1. Left: product classifier provides probabilities for classes based on extracted visual features; Right: extraction of feature vector: the last layer before the fully connected layer that produces such probabilities.

It is worth mentioning that this classifier was trained on 224 classes, much fewer than the total number of classes (more than 1000) in our problem. The question became: would the feature extraction capability extend to all classes? Our assumption was yes, and the rationale behind this was that all classes were in the same  domain (furniture and home appliances). In other words, we assumed that the classifier had learned enough from these 224 classes to generalize to other classes in the same domain.

 

Step 2: Create class-level vector representation

Following the creation of feature vectors for all images, we needed to condense the data further by creating class-level representations meaning there would only be one feature vector for one class. The idea was to average all vectors that belonged to a certain class and use the result as the representative feature vector of that class. In practice, there were many similar images for certain classes and there was no need to use all of them. We decided to apply a thresholded sampling mechanism: for each class, we kept randomly sampling images until hitting the threshold M (=1000 in our case). The result was that for classes with more than M images, only M images were sampled from the population. For classes that did not have M images, we sampled the entire population.

 

Figure 2. Image sampling with a threshold of 1000

Once we had the sampled images, we passed them through the feature extractor to obtain the feature vectors. As mentioned above, each class had multiple feature vectors. Before we averaged them to obtain the class-level feature vector, we needed to ensure that for most classes, the majority of its feature vectors were close to the average vector without getting too close to the average vectors of other classes.

 

Figure 3. Illustration of inter-class distance vs. intra-class distance in feature space.

We used cosine distance as the distance metric and compared intra-class and inter-class distances: on average, the intra-class distances were distributed around 0.15, which was on average less than their inter-class counterparts. This ensured the validity of the per-class vector averaging method.

 

Figure 4. Intra-class cosine distance distribution (left) and pairwise inter-class cosine distance distribution (right)

 

Step 3: Perform the clustering algorithm

Referring back to the general steps above, once we had the class-level feature vectors, we applied some clustering techniques to generate the superclasses. We chose Hierarchical Agglomerative Clustering (HAC) implemented in sklearn. One reason is that HAC is deterministic, except in the case of a tie, and thus ensures reproducibility.

Since the clustering was unsupervised in nature, there was a need to determine the best number of clusters N. We used the concept of “silhouette” to help make the decision. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, the clustering configuration is appropriate. If many points have a low or negative value, the clustering configuration may have too many or too few clusters.

We ran the clustering algorithm multiple times, each time with a different number of clusters and calculated its corresponding silhouette value. We chose the number of clusters that provided the highest silhouette value.

 

Figure 5. Silhouette score as a function of the number of clusters. In this run, we started with 1153 points (classes) and obtained the best number of clusters (highest silhouette score) of 470.

 

Step 4: Iterations on clustering

After running the clustering algorithm above once, we had the initial results of the superclasses. However, we had yet to achieve our ultimate goal: to ensure a minimum number of training examples for each superclass. In fact, some superclasses still had less-than-required training examples after the initial clustering.

Hence the iterations: we removed such superclasses (along with all its member classes) and re-ran the clustering. We then checked the number of training examples for each new superclass. If there were still superclasses that did not have enough training examples, we would repeat the process until every superclass had enough training examples. In our case, we only needed to iterate twice before all superclasses had enough training examples.

 

Figure 6. Left: 1st clustering; Right: 2nd clustering after removing superclasses (from 1st clustering result) that still did not have enough examples.

At this point, we had created the superclasses that could be used to train a new object detection model. In the section below, we show example superclasses and comment on some interesting results.

 

Example superclasses

 

Superclass #86: {Door Levers, Door Knobs, Cabinet and Drawer Knobs}

This superclass works out of the box: all its member classes share similar looks.

 

Superclass #4: {Classroom Chairs, Stacking Chairs, Dining Chairs, Folding Chairs, Kids Chairs, Slipcovers*, Stacking Chairs, Dining Chairs}

This superclass groups different types of chairs together. It also includes slipcovers which, at first glance, does not belong to the group. However, if we compare an image of a slipcover and that of a chair, it is not hard to see why they were included: slipcovers are usually visually represented by being attached to a chair. So from a pure visual similarity perspective, slipcovers are similar to chairs.

Figure 7. Left: Slipcover and Chair; Right: Patio Chaise Lounge and Lawn Mower. Notice the visual similarity between the pair in both scenarios.

 

Superclass #31: {Patio Chaise Lounges, Lawn Mowers**, Outdoor Sun Lounges, Lawn and Beach Chairs, Patio Lounge Chairs}

Similarly, lawn mowers are grouped together with the lounge chairs because their images share similar visual features: both have a seat and a green background.

In summary, most superclasses make sense. Some superclasses contain seemingly very different classes (the outliers) due to the fact the method was based on a purely visual perspective without taking into account other factors such as functionality. For example, lawn mowers should not be grouped with lounge chairs as they do not have similar functions. Since the functional information usually is not represented fully by the image data, we enlisted help from domain knowledge experts to modify the superclasses (by removing some outliers, for example) to ensure the consistency between visual similarity and functional similarity. The joint effort produced a final set of superclasses that is reasonable to both the humans (from a functional perspective) and the machines (from a visual perspective).

Note the original goal for creating the superclasses was to improve the performance of object detection. The section below will detail our quantitative evaluation of our superclasses to determine which object detection model performed better: the new model trained with superclass methodology, and the previous model without it.

 

Model evaluation and impact

After incorporating the concept of superclass into model training, we were able to train a new object detection model. We evaluated the new model based on two metrics: class-coverage and recall rate.

Class-coverage is defined by the percentage of the revenue (or volume) of the covered classes over the revenue (or volume) for all classes. The higher the class-coverage, the more types of objects the model is able to detect from an image. The new model covered 419 classes compared to the previous 80. As a result, the coverage increased significantly for both revenue and volume: the revenue coverage increased 24%, from 68% to 92%; while the volume coverage increased 17%,  from 75% to 92%.

We also measured how many ground truth bounding boxes(as tagged by human annotators) the model is able to detect. This is the idea behind the recall rate: for each ground truth bounding box, check if there is a predicted bounding box of the correct label that has enough IOU (0.5 in our case) with it. A higher recall rate indicates a better capability of the model to provide useful information that can be leveraged downstream. We ran the models on around 20,000 test images. The recall rate increased from 60.5% to 65.2%.

 

Table 1. The comparison between our prior model and the new model

 

The business impact of the new model is significant: an increase of 24% revenue coverage translates to more products (on the order of billions of dollars) covered by our new model. This will immensely increase the downstream business opportunities.

Additionally, the 17% increase in volume coverage means that more images (on the order of millions) can now be processed by our model instead of by human annotators. This greatly increases the speed of inference and reduces the cost associated with manual tagging.

 

Future work

Looking forward, our quest to improve model performance is never-ending. There are two new goals. First, we plan to further increase the training data guided by per-superclass performance: the model performs better on some superclasses than others, and we will create more training data for superclasses that are underperforming. Second, the superclass methodology was trained on the U.S. data set and hence works best with North American products. With Wayfair’s rapid expansion in Europe, there are product classes unique to the E.U. catalog. We plan to create the superclasses for the E.U. to include such classes with the same methodology.



Source link