Defining the problem is critical to the success of any project. Pre-analysis is often conducted to understand the current business problems and challenges, along with what we want to achieve and how to align this with business priorities.
Labeled data is the thing we are predicting in a machine learning scenario (for example, a relevant piece of content in a feed recommendation system). Label definitions are key to training, testing, and validating data sets. Depending on the implications and business priorities, the label definition could be different. As an example, to develop a churn prediction problem, the labels can be defined as completely churn (renewal rate = 0) vs. not completely churn (renewal rate > 0). Alternatively, the churn rate could be defined as partial churn (renewal rate < 1) vs. no churn (renewal rate >= 1). The first definition fits better when we focus on keeping customers, while the second definition fits better when we are focused on growth.
Features are the inputs to a machine learning system; for a feed recommendation system, the feature is the content. We are often faced with too many features from multiple data sources. Thus, we must first collect the features that are not only meaningful in solving our problem, but also in line with the labels we defined. Then, we integrate these features with the label by being careful with the alignment for dynamic features. Later, we can clean and carry out the transformation in order to best reveal the patterns of the data.
We start with partitioning our data into training, validation, and testing sets. Then, we train our model with the training set. While doing that, we should choose our solver by considering the type of our problem, the system requirements, and also the balance to strike between performance and interpretation. In order to choose a solver with the best-performed parameters, we can also run a hyperparameter search. Then, we use different evaluation techniques to choose the best model by using the validation set. While choosing the best model, we should also consider business metrics. Next, we present our model’s results on the test set.
Once the modeling process is over, we deploy and run the model in production. This enables us to schedule and run the scoring pipeline regularly.
Once we deploy the model, we regularly run feature and model performance monitoring to see how the model is performing, and if it is utilizing the right content of data. If we decide to refresh our model, we retrain the model and then conduct A/B testing in order to compare the new model with the old model. Depending on the A/B test results, we decide which model to use in production.
Even if we go through the six steps of the machine learning process, there is a chance that our model may not deliver the desired performance. This happens because there are many common pitfalls and challenges that may pop up during the process. During our tutorial, we talked about the two most common challenges: model interpretation and data quality.
Model interpretation is one of the challenges that we face in our day-to-day work. When we present our modeling results to our business partners, they care not only about the results, but also about the “why?” We could use the feature importance (coming from the machine learning model that was used to generate the results) to explain the key drivers of the results, but this method can come with some drawbacks, such as difficulty in interpreting the ranking of correlated variables or bias for variables with more categories. For example, let’s say we are building a model to predict who we should be sending email for the career subscription by using logistic regression. Suppose both the feature job search and the feature job view are important in order to decide who we should be sending email to. If these two features are also correlated, then how do we decide which one is more important?
Instead, we use group-wise feature interpretation. In this method, we cluster features into buckets with semantic meaning, and then build models based on only the subset of the features within each bucket.