Data Driven Content Categorization – Coursera Engineering – Medium


From hand-coded to an algorithmic approach

Courses on Coursera cover topics ranging from photography to probabilistic graphical models to constitutional struggles in the Muslim world. This diversity makes them hard to categorize. A couple of years ago we overhauled our course categories and implemented a new categorization system we call domains and subdomains. This post covers how we defined and implemented that new system.

The Previous Course Categories

Coursera’s original categorization scheme dated back to our founding in 2012, and was heavily influenced by the content available at the time. For example, we had five categories of computer science subfields, but only one category for all of the humanities. The categories were also manually and arbitrarily defined, resulting in redundancies (e.g., “Food and Nutrition” being nearly a subset of “Health and Society”) and vagueness (e.g., “Information, Tech & Design”).

Critically, the original categorization scheme was not meeting our need of effectively matching learner to content. For example, the “Medicine” category attracted two distinct groups of learners — because it contained two distinct groups of courses. The first were courses that appealed to healthcare practitioners (e.g., on clinical kidney transplantation or biocontainment for infectious diseases). The second were courses on public health issues that appealed to non-practitioners.

As our catalog expands to thousands of courses, we need a principled organization technique. We want categories that help the learner find the best content for them. This translates to the following criteria:

  1. Simple (as few categories as possible)
  2. Minimally redundant (as mutually exclusive as possible)

t-SNE to the rescue

Rather than re-coding by hand, or replicating traditional university departments, we took a data-driven approach.

We sought to group our courses so that someone interested in one course in that group, say, playing the guitar, might also be interested in other courses in that group, say, songwriting or jazz improvisation. The algorithm known as t-distributed stochastic neighbor embedding (t-SNE) satisfies this requirement.

t-SNE identifies an arrangement of courses such that courses sharing common learners are close and courses that do not share common learners are far apart. For example, Complex Analysis and Galois Theory are close together since many learners take both, while Taking Care of Horses and General Relativity are farther apart as the two courses do not share many learners.

We utilized the t-SNE algorithm on courses in 2015 to produce the scatterplot output as shown below. Each dot represents a single course. We then group these courses into categories by clustering (represented by the coloring).

Figure 1: t-SNE visualization of courses colored by cluster, circa 2015.

The general structure of our content

Figure 2. General subject area of courses.

Looking at Figures 1 and 2, the first thing we see is that courses are organized in a globally consistent way: humanities, social sciences, and business courses are in the top right half of the plot, while natural sciences, engineering, and computational sciences courses fall in the bottom left half.

Digging in more granularly reveals additional nuance:

  • Courses on business and finance are clustered together on the right
  • Courses about the natural sciences (physics, chemistry, and biology) are on the left
  • Courses on the computational sciences (math, cs, and statistics) are at the bottom
  • Courses on the social sciences and humanities are at the top
Figure 3. General division of science and humanities courses.
Figure 4. Substructure of courses within each half of the plot.

Dissecting each of these large regions further, we see that even within each grouping, courses are arranged logically. For example, courses in natural history span a continuum roughly from the biological sciences to the physical sciences. Similarly, courses in humanities and social science range roughly from music to the visual arts, humanities, and social sciences, and then practical business.

Even at the level of individual courses, t-SNE captures the interdisciplinary nature of courses. Courses in business law, for example, fall on the boundary between business and law, and courses on quantitative methods for the social sciences fall between math and the social sciences.

Figure 5. Interdisciplinary courses sit roughly between the right clusters.

Returning to the set of courses previously categorized as “medicine,” we now have three sub-categories. First is a relatively disjointed cluster of courses targeted to the medical professional (e.g., “Ebola: Essential Knowledge for the Healthcare Professional”). Second is a cluster of courses on healthcare policy (e.g., “Systems Thinking in Public Health”), and lastly, we have a cluster on basic biology (e.g., “Introduction to Genetics and Evolution”).

The result is noteworthy because there is no strong reason why t-SNE should arrange our courses by subject matter. We weren’t feeding in course descriptions or transcripts, just the enrollment behavior of learners. We attribute the clusterability of courses to the fact that learners are much more likely to be interested in multiple courses in a particular subject area rather than to be influenced in their course decision by non-subject-matter factors such as the style of instruction or the institution offering the course.

That said, this assumption does not hold across the full catalog. For example, among non-English content, the language of instruction is more a driver of enrollment than the subject area. Correspondingly, a French or Russian course is more likely to be grouped with other courses taught in the respective language than it is to be grouped with other courses on the same subject.

After some clean up, we landed at 36 course clusters, which rolled up into nine larger clusters. On the Coursera platform today, we term the original clusters “subdomains” and the larger clusters “domains.” We launched the new system of domains and subdomains in summer 2015. In the past three years, it has become an integral part of Coursera’s content discovery experience, allowing us to scale content categorization and drive personalization for each and every learner on our platform.



Source link