How our algorithms identify at-risk students, powering automated and human interventions
Since the beginning of 2016, Coursera has been rapidly expanding into the online degree space to give working professionals the opportunity to earn credentials from the world’s top universities. These programs allow a population of busy and geographically dispersed students to work toward valuable degrees with much greater flexibility at a more affordable price point.
One of the challenges of executing on this ambitious mission is the need to maintain a high level of support for all students in the program. This includes facilitating university admins in supporting a much larger cohort of students than that found in the traditional on-campus setting, without the usual benefit of being able to interact with these students face-to-face every day. To succeed, administrators need to know which students are most in need of their support and how to best support them efficiently.
To solve this challenge, Coursera is leveraging one of the core advantages of online learning over the traditional classroom — rich behavioral and learning data. We track each active student along many axes of their in- and out-of-course activity, and use machine learning techniques to understand the relationship between these features and a student’s level of risk. This allows university administrators to quickly and accurately diagnose which students are most in need of support, unlocking a level of efficiency not possible in other contexts.
Dropout prediction is not something new in online learning, but the degrees context requires a different flavor of prediction to succeed. We follow these steps to both predict dropout prevention needs and to go beyond that to help students achieve their degrees:
- Feature engineering: We track relevant features about each degree enrollment. These features can be enrollment-level (activity patterns, assignment completions, grades, etc.), user-level (time in the program, previous performance), or course-level (historical difficulty).
- Model training: Using terms of degree programs that have been completed, we train a model that predicts the likelihood of each course enrollment being completed on time as a function of the features we calculate in the previous step. We leverage a varying-coefficient model to create predictions that are as accurate as possible at any point in the term. For example, average historical course difficulty is an important feature at the beginning of the term, but becomes less important as students move through the course and their activity provides a relatively stronger signal of their likelihood to complete.
- Output predictions: For each currently active enrollment, we use the most recently trained model to predict the likelihood of the course being completed. Each day, the features and predictions are updated based on the actions taken by the student. These predictions can then be leveraged by humans — and by automated products — to understand which students will benefit most from an intervention, and which type of intervention they need.
Putting these predictions to effective use demands not just the identification of at-risk students, but also the inference of the reasons behind a student’s struggles so as to make the best intervention to get them on track. That is, knowing that a student is at risk in and of itself does not immediately inform an administrator of how exactly they can help this student: Is the problem that the student has been absent for the last week and might have a life circumstance interfering with their education? Or is it that the student has been struggling with programming assignments and might need more attention from a TA? These two situations are addressed in very different ways — in this case even by different people at our partner institutions — and it’s important that our at-risk predictions tackle this nuance.
Our solution is to leverage models that are easily interpretable to the human eye. In addition to predictions of the likelihood of completion, we also output the relative importance of each feature (or set of features) in driving the prediction. This, in turn, informs a different suite of interventions depending on which factors are most important to a student’s at-risk status.
For students who are only marginally at risk, we trigger an automated email that recommends steps they can take to get back on track based on their course progress. For students who need a bit more help, the model outputs and at-risk features are shared with student success teams at our university partners, empowering them to take maximally informed actions with the students for whom they are responsible, focusing on those who need them the most, and tailoring their intervention to the data-powered summaries of why that particular student needs help.
Our degree student at risk model is still in its early stages. As we scale our programs and continue to gain more training data and experience, our model will continue to improve in both accuracy and interpretability. In addition, we are continuing to learn from the student support experts at our partner institutions and in turn expand the suite of automated model-powered interventions — an expert-powered feedback loop!
Continued automation of student support will be paramount in enabling us to deliver on our promise to increase access to transformational learning experiences at low cost and at scale. We are excited to have both rich data and some of the best experts in the field accelerating these advancements for Coursera learners.