Last year, Wayfair launched Visual Search, a new and novel way to find products on our website. Users can now upload photos of furniture they like and find visually similar matches in an instant. If you need a little refresher on this technology, the original blog post can be found here.
We’ve learned a lot over the past year and have made significant enhancements, including re-architecting the backend for scale, removing friction for users on the frontend, and improving the accuracy of our models.
Object Detection gives us the ability to locate and classify objects of interest within an image, and is now integrated into our Visual Search feature to streamline the user experience. In previous iterations of Visual Search, a user would need to draw an accurate crop-area around an object for the best results. Our new model automatically detects objects and preselects a crop-area for search upon an image upload. The user can then switch between searching for any of the objects detected or create a new crop-area.
Today’s state-of-the-art solutions involve convolutional neural networks and come in a variety of architectures. After experimenting with Single Shot Multibox Detector and Faster R-CNN, we decided to go with the latter for its higher accuracy.
Tensorflow’s Object Detection API allowed us to train a model with ease. It had a diverse model zoo and great flexibility with a network’s structure. We used a set of 130,000 training images with 557,000 tagged objects and 37 unique classes, which was available to us as a result of an internal effort to tag the products in all of our imagery. Pretrained weights for Faster R-CNN Inception Resnet V2 on the COCO dataset were used for transfer learning. After 25 days of training, our model achieved a mean average precision of 54% and recall of 91% at 0.5 intersection over union. Qualitative evaluation demonstrated that the model detected a majority of the objects and produced few false positives.
Tensorflow, GPUs, and Search
Visual Search requires our models to be accessible through a web service to process query images. Initially, the visual similarity model ran within a Flask microservice using Theano on CPUs. This architecture was nice and simple, but we wanted to improve the performance and modularity of the platform. To accomplish this, we separated the computer vision models and search logic into independently scalable applications.
Switching to Tensorflow opened the door for us to use Tensorflow Serving as the model application. This gRPC service loads the visual similarity and Object Detection models to make them available for gRPC clients. We built Tensorflow Serving with GPU support and deployed it to a GPU cluster. This considerably reduced the visual similarity model runtime from 500ms to 100ms per image. The Object Detection model currently performs at 600ms per image.
The Flask microservice interfaces with image uploads, connects to Tensorflow Serving, and contains the search logic. For Object Detection, the service processes an uploaded image through the Object Detection model via Tensorflow Serving and returns all predicted bounding boxes (i.e. potential crop-areas). For Visual Search, the service processes an uploaded cropped image through the visual similarity model, retrieves that image’s embedding and class probabilities, does a k-nearest-neighbors (kNN) search on a subset of the indexed images, and returns the closest matches (again, more detail in this post). This Flask application does not need to run on GPUs and can scale independently of the Tensorflow Serving GPU nodes.
Besides architectural changes, response times for Visual Search were reduced further by changing the kNN search libraries. Initially, images were indexed in ball trees, but their performance degraded noticeably as the image count increased. After exploring several options, NMSLIB was the clear winner, dropping worst-case kNN search performance from 500ms to 25ms. Instead of using a binary tree structure, NMSLIB created a small-world graph where nodes are images and edges connect similar images. Though NMSLIB’s kNN search results were only approximate, they were similar enough to ground truth, with a loss of quality not noticeable.
What’s Coming Next?
Many mobile devices are now capable of running some models without the need of external servers. With iOS’s CoreML and Android’s Tensorflow Lite APIs, we plan to implement real-time Object Detection computed on user devices in order to eliminate network latency and reduce the strain on our servers. We are currently training SSD models that will be performant on mobile CPUs.
Recommendation systems using visual similarity are visually straightforward and avoid the cold start problem seen in collaborative filtering. We are currently generating image-based recommendations on product pages to expose similar looking products, while we explore additional avenues to utilize our model to power features. We hope to empower users in the future to discover similar product alternatives varying by brand, price point, and other attributes. More updates to come!