And, of course, it’s straightforward to create new transformers, too.
Many common models are included: K-means Clustering, Gradient Boosted Decision Trees (XGBoost), Logistic Regression (liblinear), Isotonic Regression, FastText (an enhanced Java port), and Neural Networks.
Neural networks may be assembled seamlessly as part of the encompassing DAG definition using Dagli’s layer-oriented API, with the architecture specified as a directed acyclic graph of layer nodes. Many types of layers are provided, with more planned in the future; an even wider range of model architectures is supported by using CustomNeuralNetwork to wrap an arbitrary DeepLearning4J model. And, of course, if you have an existing, already-trained TensorFlow or PyTorch model, you can use their respective Java bindings to implement a new transformer that wraps them, too (although unfortunately, defining and training new models from Java is not yet well-supported by either framework).
Dagli provides “meta transformers” for model selection (choosing the best of a set of candidate models), cross-training (used to avoid overfitting when one model’s prediction is an input to another), and other, more specialized uses (like training independent model variants on arbitrary “groups” of examples, as might be done for per-cohort residual modeling).
Dagli offers a diverse set of transformers, including those for text (e.g., tokenization), bucketization, statistics (e.g., order statistics), lists (e.g., ngrams), feature vectorization, manipulating discrete distributions, and many others. The list of Dagli modules can provide a good starting point for finding the transformer you need.
Evaluation algorithms for several types of problems are included as transformers that can either be used independently or as part of a DAG. Input data can be provided to Dagli in any form (example values just need to be available from some arbitrary Iterable), but we do include classes for conveniently reading delimiter-separated value (DSV) and Avro files, or writing and reading back Kryo-serialized objects. @Structs provide an easy-to-use, highly bug-resistant way to represent examples. Finally, visualizers for rendering DAGs as ASCII art or Mermaid markdown are provided (clients are, of course, free to add others), which can be especially helpful when documenting and explaining your model.
Dagli also contains several “sublibraries” that are useful independently of Dagli:
- com.linkedin.dagli.tuple: Provides tuples, sequences of typed fields such as Tuple3<String, Integer, Boolean> (a triplet of a String, an Integer and a Boolean).
- com.linkedin.dagli.util.function: Functional interfaces for a wide range of arities and all primitive return types, including support for creating “safely serializable” function objects from method references.
- com.linkedin.dagli.util.*: A collection of data structures (BigHashMap, LinkedNode, LazyMap…) and many other utility classes (Iterables, ArraysEx, ValueEqualityChecker…) too extensive to adequately document here.
- com.linkedin.dagli.math: Vectors, discrete distributions, and hashing.
With Dagli, we hope to make efficient, production-ready models easier to write, revise, and deploy, avoiding the technical debt and long-term maintenance challenges that so often accompany them. If you’re interested in using Dagli in your own projects, please learn more at our Github page, or jump straight to the list of extensively-commented code examples.
Thanks to Romer Rosales and Dan Bikel for their support of this project, David Golland for contributing the Isotonic Regression model, Juan Bottaro for his tokenizer implementation, and Haowen Ning, Vita Markman, Diego Buthay, Andris Birkmanis, Mohit Wadhwa, Rajeev Kumar, Phaneendra Angara, Deirdre Hogan, and many others for their extensive feedback and suggestions.