Though far from a reality in many fields, it is now widely accepted that building a diverse team will make your organization more successful. Many insightful articles and research studies have driven this point home, focusing on diversity axes like gender, ethnicity, culture, age, orientation, and physical ability. Those dimensions are all important. If you participate in the hiring process for your team, and haven’t taken the time to read about why diversity is critical, you absolutely must. Like, *now*. This post can wait.

**Amass More (Methodological) Tools**

Data Science is no exception to the rule that diversity breeds success. In fact, I’d argue that diversity of viewpoints has even more leverage in a data science team compared to other organizations. Why? Data Science is all about problem-solving: distilling an unstructured business need, translating it to a tractable mathematical framing, converging on the methodological details, and implementing the resulting model or algorithm. The problem-solving process has been summarized by Helge Seetzen of TandemLaunch as “mapping a fairly small set of ready-made solution archetypes onto the problem and going with the best fit.” It stands to reason that the larger the set of archetypes at your disposal, the better solution you will find. By adding new team members with different points of view, you are adding unique solution archetypes.

A great way to diversify your set of solution archetypes is to hire scientists with **many different academic backgrounds**. For many of us, our grad school years are formative. It’s when we learn how to think like a scientist — when we hone the tools that will be brought to bear on real-world problems. For other data scientists, grad school is replaced by years of work experience in a domain, or by extensive self-study. These experiences train us to **specialize** — that is, to emerge with one or two really dependable, finely-tuned tools.

As a data science team builder, your first goal should be to amass the biggest collection of high-quality tools that you can. From there, the obvious approach seems promising: given a new data science problem, choose the most appropriate tool from that large, diverse collection.

In other words, build your team like a Swiss army knife.

**Data Science as Optimization**

But the notion of discretely mapping a problem to a tool is actually too rigid in practice — there is considerable variation within any discipline, and methodologies are often combined for greater effect. Extending the idea, one can think of problem-solving as a general optimization exercise, with different academic orientations representing axes of the parameter space over which you are optimizing (see Figure 1 for a cartoon illustration).

In this example, the best solution can be achieved with the right combination of ideas from statistics and operations research (OR). Or can it? The maximum in Figure 1 corresponds to the best solution we can achieve after projecting down to a 2-D subspace, namely the (statistics, OR) plane. With a third dimension, say machine learning, we may do even better!

We’ll revisit this analogy later when discussing the tradeoffs between generalization and specialization.

**Diversity All the Way Down**

The more diverse your problem space, the more academically diverse your team should be. For example, Lyft has data science problems across far-flung problem areas like Line passenger matching, fraud detection, user engagement, location estimation, and fare prediction. It’s tempting to bucket these problems into broad categories and staff accordingly: hire statisticians for “inference problems,” operations researchers for “optimization problems,” computer scientists for “machine learning problems,” etc. This approach fails for two reasons:

1. *A problem cannot fit squarely in one bucket*

Lyft Line matching would seem to map to a well-studied optimization problem: construct a graph of passengers, draw edges based on some constraints, and solve for a matching that globally maximizes a pre-specified objective function over some batch of ride requests. Unfortunately, both the constraints and objective function depend critically on variables which are either not yet observed or are unobservable:

- what is the fastest route for a given match? (optimization)
- will the passenger cancel given a particular detour length? (prediction)
- what is the local state of demand in the network? (inference)

And it doesn’t stop there. To find the optimal route we must predict travel times and distances accurately; to build such prediction systems we must make inferences from historical data to construct detailed models of road network flow. In this sense, data science problems have a fractal nature: problems beget subproblems, which beget sub-subproblems, and so on. At each level, the paradigm we see most commonly at Lyft is illustrated in Figure 2. Models and inference feed into prediction systems, which feed into optimization modules, which make the decisions which generate more data, which feeds back into models.

2. *Data science problems are not cookie-cutter*

Real data is messy — rarely will one find an open-and-shut case that yields to a ready-made textbook solution. In a diverse team, it will be equally rare to see two data scientists approach a problem from exactly the same angle. Take dynamic pricing: the problem of using price as a lever to balance our ridesharing marketplace at a micro scale and ensure reliable service for all users. An economist looks at this problem and sees supply and demand curves, and elasticity estimation. The electrical engineer sees a classical control system. The statistician sees a forecasting problem, the mathematician a spatio-temporal stochastic process, etc. They are all right, and each of these viewpoints can add value to the solution.

**Generalists and the Curse of Dimensionality**

Given the importance of diversity of academic orientation, it may be tempting to hire only generalists: jack-of-all trades types who have exposure to a broad range of data science concepts but limited depth in any one area. Generalists are versatile and can move from project to project with ease. In the early stages of a startup, scientists in this mold can be extremely valuable — especially if they have the scrappiness and engineering skills to go along with their methodological breadth.

Suppose that a major problem area in your company spans two academic disciplines, statistics and OR. It’s tempting to think that hiring a generalist with some expertise in both fields yields the same scientific power as hiring two specialists — at half the cost! Or at worst that the single generalist will be half as effective as the pair of specialists. This intuition is incorrect for reasons rooted in the so-called Curse of Dimensionality. (The notion that as the number of dimensions increases, the amount of space covered by a product of 1-dimensional sets becomes, relatively speaking, smaller and smaller.)

In general, suppose a problem’s optimal solution lives in *n* linear academic dimensions. Assume for simplicity that a scientist’s expertise occupies a contiguous interval of some length (possibly 0) in each domain. Her **total expertise** can then be defined as the sum of the lengths of these intervals. Let’s further assume that all scientists have the same total expertise *E*. Along each dimension of the solution space, a generalist’s expertise therefore only occupies a relatively small range: assume for simplicity that it’s *E*/*n*. By contrast, each specialist covers *E* units of expertise in a single dimension.

If we assume a uniform prior on solutions to our problem, then the probability of finding an optimal solution is proportional to the volume, in *n*-D space, of the product of the 1-D expertise sets. The total volume that the generalist covers is (*E*/*n*)^*n* — this is vanishingly small for large *n*. By contrast, *n* specialists together account for *E^n* units of *n*-D expertise. This is ** n^n times the knowledge for n times the price — a bargain**!

Hold on a second. Can’t we just add *n – *1 more generalists, each with expertise *E* divided evenly among the axes, to achieve the same *n*-D volume, i.e. (*n E*/*n*)^*n* = *E^n *? This would imply that a team of *n* generalists is just as effective as a team of *n* specialists. Unfortunately, it is likely that **any two generalists will have highly overlapping expertise** along a particular dimension. This is because in any subject area, it is necessary to learn the basics before going deep. As a result, the total expertise of *n* generalists along one axis may be substantially smaller than *E*. These phenomena are illustrated for *n* = 2 in Figure 3.