When looking for a new piece of furniture to add to a room, customers generally consider two major points: the visual appearance of the new item, and its harmony with the existing furniture inside the room. Together, these two considerations define a customer’s style preferences. Wayfair’s vast product catalog offers our customers plenty of options so that they can find the perfect item; but, sifting through millions of products individually would be a very time consuming process. So, in order to help our customers find what they are looking for without sacrificing their valuable time, we built an end-to-end algorithm which can recommend a new piece of furniture given a room image. The algorithm considers both product level similarity and also room-level style information. As such, we named this model Harmonia to honor the Greek Goddess of Harmony.
Wayfair’s Data Science team tackles a wide range of projects, including the development of powerful computer vision algorithms for helping customers find the products they are looking for. One of the most powerful of these is a product recommendation algorithm we refer to as Visual Search (VS). This algorithm is a visual metric that is defined between two product images based on their visual similarity. We use VS to understand the visual properties of a given product in order to recommend the most visually similar items from Wayfair’s catalog. As such, it is very useful for helping a customer to find the object that she likes based on specific limitations such as price, availability, shipping date, dimensions, etc. However, this algorithm does not consider any style or context information (e.g. harmony of the recommended product with the products that she owns), so it is not very useful when it comes to suggesting different ideas to a customer based on her style preferences.
Another algorithm that we developed at Wayfair is Room Style Estimator (RoSE). The goal of this algorithm is to understand the stylistic preferences of our customers and use this knowledge to talk with them through room images. RoSE is a deep neural network model trained to classify scenes into different styles by considering the whole room image rather than focusing on individual items. RoSE enabled us to design a visually-oriented onboarding quiz which shows a customer some room images and asks her to choose the ones that she likes. Then using the algorithm, we suggest stylistically similar room images to inspire her with new room ideas and help her explore the associated products she likes. Unlike VS, RoSE does not use any product level information, it only analyzes the context of the room from the whole image.
Both Visual Search and RoSE aim to help guide a customer to find the products she is looking for, proposing solutions from different perspectives (VS from the level of the product, and RoSE from the more expansive level of style). And both of these algorithms have achieved promising results for different types of use cases: the RoSE model has inspired customers to get more room ideas based on the room images they liked, and VS has provided a great platform for customers who are confident about what they want to buy and want to find exact matches or visually similar items to a query image.
However, customers who are at the exploration stage need to see a diverse set of products that match with their existing room design and their style preferences. Based on these observations, we see great potential in improving our product recommendations through developing a better understanding of our customers’ style spectrums, and the contextual information in the room images provided by RoSE (the harmony of products inside a room). Therefore, in Harmonia, we use the room images that our customers like or capture to analyze the customers’ style spectrums, and then define a mapping from RoSE embeddings to VS embeddings to provide product recommendations considering both room style and product similarity.
To enable stylistic product recommendations as discussed above, we introduced an end-to-end deep neural network that takes a room image as an input and outputs two vectors: (i) a vector of style probability distributions and (ii) a point in the visual search space for querying the recommended products. Due to the training regime used in the VS project, different objects clustered around different places on the VS space. Since Harmonia is also mapping to VS space, we decided to focus on a single product category for each network and create different networks for different product recommenders (e.g. sofa recommender, area rug recommender,etc.). Figure 1 illustrates the sofa recommender. The network consists of three main parts:
- Base Network: Takes the room image as input and creates a high dimensional embedding with stylistic and also product-level features.
- Room Style Classifier: Takes the room embedding as input and outputs a probability distribution over a predefined set of classes.
- Product Similarity Regressor: Takes the room embedding as input and outputs a point in the VS space (512-Dimensional unit sphere).
In our final iteration, we used ResNET50 as the base network, 3-layer fully connected neural network (FCNN) as the room estimator network, and 2-layer FCNN as the product similarity network. Initial weights for ResNEt50 were transferred from the ImageNET training and we initialized the other two networks with random weights.
One of the biggest challenges during this project was the data collection process. Ideally, we would use room-to-style and room-to-product pairs for training data. However, the amount of data with both labels was very limited, so we decided to leverage the data with each of the required labels separately as well. So, overall we created three different sets of training data: (i) Complete Set which includes both of the labels, (ii) Style Set which only includes room-to-style labels and (iii) Product Set which only includes room-to-product labels. We used all three datasets to train the base network, but the room style network is trained with only the “Complete Set” and “Style Set” whereas the product similarity network is trained with only the “Complete Set” and “Product Set.”
For quantifying the model’s performance, we collected a triplet based dataset for sofa recommendations. Each example consisted of a room image and two sofa recommendations. We asked experts the following question: “Which of the recommended two sofas would you prefer to use in this room?” and marked their choice as a positive example. See Figure 2 for examples from the triplet results. Since we also needed a baseline algorithm to compare against, we only used rendered room images that included sofas from Wayfair’s catalog, in order to generate a baseline algorithm using VS where the query point starts from the VS embedding of the sofa in the image.
Based on 1800 triplets, the VS baseline achieved 62.4% accuracy while choosing the positive example. Comparatively, Harmonia achieved 63.5% accuracy. This is a 13.5% increase compared to random guessing and 1.1% increase compared to the baseline algorithm. Even though the overall performances of the Harmonia and VS algorithms are very close, their sample-based results differ significantly. Together, the correctly predicted recommendation triplets of either VS or Harmonia algorithm cover 93.2% of all triplets. This result demonstrates that both algorithms are successful for different sets of examples and if we can create a well-defined distinction between the use cases, we have the potential to create a very successful system. For example, VS could be used to solve an “exact match” problem where the customer wants to find the exact same product that she saw somewhere else, whereas Harmonia could be used to recommend her new products which might fit her specific style. Similarly Harmonia could be used with the “search with room” tool which provides a customer with complementary product recommendations after submitting a picture of her room. Figure 3 shows some results for some sample room images. Here, it is obvious that Harmonia tries to recommend a variety of options based on the room’s style and color palette whereas VS focuses on finding an exact product match.
As we noted in the previous sections, the main advantage of Harmonia over VS is the fact that Harmonia can suggest a variety of products, taking into account room style and also product similarity; therefore, it can be used effectively in helping a customer uncertain of what she wants to explore new ideas and see a variety of choices. For a customer who wants to find an exact match, the VS model is more useful. However, the VS algorithm requires an additional object detection step to extract a comparison product before performing the recommendation; this additional step might introduce detection errors and thus reduce the overall performance of the product recommender. Harmonia does not require this step and therefore avoids this potential pitfall.
In this project, we used a very limited amount of fully labeled data which made the learning progress more difficult for the deep network. Looking forward, we may use Wayfair’s large 3D model library to curate a synthetic (but very realistic) dataset and use that in training. The Harmonia project is our initial step towards building a mapping between room style and product similarity spaces. In Harmonia, we introduced a mapping from room space to product space. In theory, one can also find a mapping from product space to room space. This avenue could lead to very interesting applications such as decorating a full room starting with only a single product.