How to train and score data in Scala using our Isolation Forest library
The output DataFrame, dataWithScores, is identical to the input data DataFrame but has two additional result columns appended with their names set via model parameters; in this example, these are named predictedLabel and outlierScore.
Why use Isolation Forests?
We chose the Isolation Forest algorithm for multiple reasons. Isolation Forests are a top-performing unsupervised outlier detection algorithm. Isolation Forests are also scalable, as their computational and memory requirements are low compared to common alternatives. There are fewer assumptions (e.g., non-parametric, no need for a distance metric) than other candidate algorithms. Finally, Isolation Forests are actively used in academia and industry, which allows us to leverage best practices and new developments shared by others in the field.
For some types of abuse, such as spam, it is possible to have a scalable review process where humans label training examples as spam or not spam. There are other types of abuse, such as scraping, where this kind of scalable human labeling is much more difficult, or impossible. Often, the labels you are able to obtain for training and evaluation are fuzzy; the precision may be less than ideal and there may be poor recall for some types of abusive behavior. Unsupervised techniques such as Isolation Forests are designed for problems with few or no labels, so they help to circumnavigate these label-based challenges.
The lack of good labels for training is further complicated by the fact that the problem is adversarial. Bad actors are often quick to adapt and evolve in sophisticated ways. Even if we are able to obtain some labels for abusive behavior identified today, the labels may not be representative of what abusive activity looks like tomorrow.
Isolation Forests can be used to overcome this challenge. As long as new abusive behavior is located in a different region of the feature space than normal, organic user behavior, we often can detect it using outlier detection—even if we did not have labeled examples when the model was trained.
Abusive behavior is a very small fraction of all member activity on LinkedIn. This is a natural use case for outlier detection, where the outlier class is expected to be fewer in number compared to the inlier class.
The Anti-Abuse Team at LinkedIn detects and prevents a wide variety of abuse across a diverse set of product surfaces. Our models often use features calculated using events from tracking infrastructure owned by other teams. This is a heterogeneous, dynamic environment that requires the use of a generalizable modeling strategy that supports easy retraining as the infrastructure changes underneath our models. Isolation Forests are easy to retrain if feature distributions shift, which helps to satisfy this requirement.
Potential uses for Isolation Forests
There are many uses for unsupervised outlier detection in the abuse detection domain and other related areas at large internet companies, including:
- Automation detection: Identify abusive accounts that are using automation to scrape data, send spam, or generate fake engagement
- Advanced persistent threats: Identify and prioritize sophisticated fake accounts for review by human experts
- Insider threats/intrusion detection: Detect compromised employee machines via anomalous network traffic
- ML health assurance: Automatically detect anomalous feature values and shifts in feature distributions
- Account takeover: Increase recall for account takeover detection
- Alerting on time-series data: Find anomalies in multi-dimensional time-series data
- Payment fraud: Flag suspicious payments to prevent fraud
- Data center monitoring: Automatically detect anomalies in data center infrastructure
The Isolation Forest library is now open source
As a result of the successful use of the library across multiple abuse verticals at LinkedIn, we have released it as open source software. You can find our Isolation Forest library on GitHub.
Special thanks to Jenelle Bray, Shreyas Nangalia, Romer Rosales, and Ram Swaminathan for their support of this project. Thank you to Frank Astier and Ahmed Metwally for providing advice and performing code reviews. Finally, thank you to Will Chan, Jason Chang, Milinda Lakkam, and Xin Wang for providing useful feedback as the first users of this library.