In December, we attended the artificial intelligence and machine learning conference NeurIPS 2018 in Montreal, Canada. In this post, we share our personal observations from the event, explain the trends in artificial intelligence research, and provide an overview of specific hot topics in addressing the problems in online systems and web applications. Though we provide an overview of each problem for a broader audience, we will also dive into a couple of selected papers or ideas in more detail.
For those that may be unfamiliar, NeurIPS consists of three main sessions: Tutorials, Main Conference (including keynotes and publications) and 20+ parallel Workshops. The conference has gotten bigger and bigger every year both in terms of the audience size and publication counts. Both the conference and the workshops had several parallel sessions, so given the limited space and our bias towards consumer web applications, we likely missed several quality papers, research directions, and ideas in this writeup.
The topics addressed throughout the conference seemed to be more diversified compared to the previous few years:
Deep learning continued to be prevalent, but the conference was slightly more diverse regarding topics. Research areas such as variational inference, optimization techniques, Gaussian processes, and fundamental understanding of algorithms were also quite visible at the conference.
There was a massive focus on fairness, diversity, and AI for social good across keynotes, workshops, and discussions.
The uptrend in reinforcement learning continued.
It was clear that the ML community cares about error bounds and confidence intervals (again), not just estimates. Welcome back Bayesian machine learning!
Workshops covered diverse application domains such as medical imaging, social good, mobile devices, as well as computer architecture and databases.
Keynotes: Public policy, trustworthiness of algorithms, and reproducibility of research
Keynotes are among the most essential sessions at NeurIPS. They have always been very high quality, addressing critical topics, providing a vision, and guiding the community on which areas to pay attention to. For this post, we’ll focus in on three talks:
The topic of Public Policies for AI was discussed by Edward Felten of Princeton University, who is not only a well-known computer science researcher but has also served at the White House Office of Science and Technology Policy as Deputy U.S. Chief Technology Officer. His main discussion points centered on whether AI should be regulated or not. If yes, how? How can AI researchers better inform policy decisions? There are two key messages Felten delivered:
We should expect AI to be regulated. AI is revolutionizing the world, and as AI changes everything, this will include law and policies as well. The question is how AI researchers can provide the right guidance for this change.
A strategy for researchers to have the right impact in shaping the public policy should include a thorough blend of three aspects: 1) knowledge/facts, 2) the preferences of the legislators/decision makers, and 3) a thorough analysis of the decision space.
The second topic was around the Reproducibility of AI Research, particularly in the context of reinforcement learning—which is one of the most popular research areas in machine learning these days. Reinforcement learning focuses on AI agents constructing their behaviors with rewards provided in a constrained environment, such as games like chess or Go. Joelle Pineau of McGill University, who is also the head of the Facebook AI Research lab in Montreal, talked about reproducibility, reusability, and robustness of AI research in this direction. The main discussion point is whether we have a reproducibility crisis in reinforcement learning, and Prof. Pineau provided several important data points suggesting that even highly cited research papers’ results cannot be reproduced consistently. Common systematic problems include the way research has been implemented and published, the experimentation framework, the detail in the presentation of results, and the use of statistical tests. The data she provided was extensive. For example, baselines of the same algorithm change from publication to publication; for the same algorithm, results change from implementation to implementation. Even for the same algorithm and implementation, from run to run, results significantly differ. This is not acceptable scientific rigor. We should be able to measure the strength and weaknesses of various algorithms systematically. The keynote provided a thorough reproducibility checklist for the community to follow and demonstrated that if the checklist is completed, we can get significantly more stable results.
David Spiegelhalter of Cambridge University talked about the Trustworthiness of Algorithms. As AI becomes more mainstream and is used in applications such as self-driving cars, and health care applications, its trustworthiness becomes critical. In the current situation, there is no systematic procedure for evaluating and deploying AI solutions. When we look into similar areas where there is a direct impact to human life, such as drug development, we observe that there are several very strict phases that should be followed by a researcher before a drug is available to the public. We should expect a similar disciplined procedure for AI research and development as well. For example: Phase 1: Digital testing, Phase 2: Lab testing, Phase 3: Field testing (including alternative designs for randomized controlled experiments.)
Below is an overview of select problems, and deep-dives into a couple of related papers that were presented at the event. We then talk about the workshops and cherry-picked talks. We highlight the trends in machine learning innovation and the typical problems in online systems and web applications.
Adversarial training has become a foundational machine learning tool
The term “adversarial” got highly popular with the introduction of Generative Adversarial Nets (GANs) by Goodfellow et al. The term has been used in different contexts, including: 1) Synthesis/generation, as in GANs; 2) Adversarial attacks, which refers to hacking deployed AI solutions via adversarially-modified input data. One additional context that has appeared is data bias elimination. This trend is continuing in full force. For example, Gong et al. address the word frequency bias in text corpora using adversarial training. When certain words have a much higher frequency (which is almost always the case), learned representations are dominated by the confounding factor, word frequency, instead of semantic similarity. In such scenarios, in Euclidean geometry, two words close to each other signifies similarity in terms of frequency, not semantics. This is not a desirable outcome.
The authors use adversarial training to address the problem. A neural network that’s learning word representations is connected to both the predictor (each application in question has its own task as the predictor) and a discriminator trying to distinguish high and low-frequency words from the word embeddings in the given problem. The entire system is adversarially trained through a minimax objective. When converged (i.e., the zero-sum game reaches the Nash equilibria) the discriminator is only able to randomly guess its label at best, and this demonstrates that the learned representations (which are the input to the discriminator) do not carry confounding frequency information anymore. This results in higher-quality semantic embeddings (i.e., in Euclidean geometry, two words being close to each other will mean that they are semantically similar).
The authors present several different NLP tasks: 1) word similarity, 2) machine translation, and 3) language modeling. For each application, both the baseline method and the adversarially-trained version are compared. Consistently, the adversarially-trained representations boosted the prediction task performance, decoupling the confounding word frequency from the real problem. Almost any discrete data generated by users in the online world (e.g., posts, engagements, likes, unstructured text via comments) follow power-law distributions (Szabo et al.).
Another essential application of adversarial training is to enforce fairness in classifiers using a similar architectural pattern but with a different impact to the final prediction performance. Without adversarial training, a vanilla methodology for fairness is to use the explicit fairness input (e.g., gender) in training (to learn the bias factor) as a feature and then during inference to not take it into account. This method is not sufficient because other correlated signals used during training can expose the concerned gender information implicitly. Adversarial training can be a solution to this problem. For example, in NeurIPS Relational Representation Learning workshop, Bose et al. use it to enforce fairness constraints (e.g., gender) on graph embeddings, even if such information is not explicitly available in the feature space. Once converged, the discriminator cannot predict the concerned variable and the bias is eliminated. Since such signals can be important predictive attributes, this results in the ML algorithm losing predictive power (as expected).
In contrast, in Gong et al. the bias in data deteriorates the performance of the predictive task. Hence, eliminating the bias helps to improve the representations and improve predictions. Whereas in Bose et al. (the fairness application), the concerned attributes do help the predictive task, but we still want to eliminate them because of non-technical reasons (e.g., we would like to provide equal chance). Even though the technical architectures seem similar in both applications, the way underlying bias is correlated with the prediction task is different in each application resulting in improved prediction performance in Gong et al. and deteriorated performance in Bose et al.
Often, one needs to label data with a set of tags with a cardinality of hundreds of thousands or millions. In classical multi-label classification, this is either infeasible or requires huge model sizes. So systems often constrain researchers to build a single global model for all the item-tag cross-pairs and score each tag candidate. Such a global model might scale from the system’s point of view, but precision in specific tags can be extremely poor. Two papers in the main conference were engaging on this front.
Evron et al. consider error-correcting codes, which have been studied highly in electrical engineering and communication theory. The main idea is to encode the classifier output with the appropriate lattice and to decode during the scoring. Depending on the parameters of the lattice (e.g., depth vs. width), one can trade-off between the loss in decoding and the computation time. This gives the ability to adjust between precision and system load. The proposed algorithm provides a logarithmic time and space complexity.
Wydmuch et al. considers probabilistic label trees and demonstrates that it is a generalization of hierarchical softmax. Similarly, this architecture allows logarithmic complexity using tree-like structures.
Text, NLP, and more
In addition to Gong et al.’s word frequency paper, there were several other word-embedding and NLP papers. In an analysis paper, Scott et al. studies inductive transfer learning, considering three lines of research directions in which all are trying to take the learned information from an original domain (e.g., text corpus of all research articles) and use it in a target domain prediction problem (e.g., a new conference/workshop on a rising topic) in which, often, the data is limited.
Weight transfer: the learned coefficients of the initial model trained on the original domain are used as an initialization for the embeddings, and the full training is run on the target domain.
Deep metric learning: the cluster of embeddings capturing the classes is learned in the original and target domain with the specified loss.
Few-shot learning: the number of labeled data in the target domain is very limited, and one uses proto-type embeddings to be able to classify the samples.
Their main findings state that the weight transfer method performs the worst among all three approaches, and the best methodology is to use adapted embeddings in which the original domain weights are fine-tuned in the target domain via limited training (early interruption prevents one from overfitting the limited training data in the target domain).
Finding similarities in text corpora is another widespread problem. Constructing content-to-content similarity graphs and jump starting content-based “related” items recommendations are the two apparent applications. Deudon proposes a siamese network-based distance learner in which two text samples are fed to a siamese network and Wasserstein-based distances are computed over the embeddings modeled as a Gaussian distribution with diagonal variance. The embeddings can be learned via any downstream network—in the presented experiment, LSTMs are used. The author provides experiments for Quora question similarities, and they also demonstrate that pre-training is critical for competitive performance.
Sentence and phrase generation is a problem with many different applications, such as chatbots, smart-replies, auto-complete systems, grammar checking, and text quality optimization. Text synthesis from scratch is a difficult task; even state-of-the-art algorithms produce outliers, making deployable product applications more challenging. Hashimoto et al. present an algorithm that edits examples from the training set and produces high-quality sentences. The key idea is to first retrieve a sample from the training set similar to the input and then edit it via predicting the final sentence, conditioned on the retrieved input. The authors train the retriever and editor disjointly due to computational difficulties. For the retriever, an encoder-decoder network is trained to extract embeddings, and for the input text, the nearest neighbor over the embeddings from the training set is found. For the editor, a sequence-to-sequence model with attention is used.
The uptrend in reinforcement learning continues, but…
Almost a quarter of the main conference and four dedicated workshops were about reinforcement learning. From deep reinforcement learning to Bayesian reinforcement learning and policy-gradient methods, the community is attacking the problems from every angle. However, one issue that we observed (also raised in the keynote by Joelle Pineau) involves demonstrating the published research on real-world problems in a reproducible way. Though there are popular successes such as Alpha-Go, most of the papers in the conference presented results on toy simulation platforms.
The last two days of NeurIPS were dedicated to workshops. Given that roughly 20 workshops occur simultaneously during each of these two days, it is nearly impossible to cover all of the talks. Notably, the Neural Information Processing System (NeurIPS) organization took major steps to add tracks focused on diversity and inclusion. This included, for example, Women in Machine Learning, Black in AI, Latin in AI, and Queer in AI.
Latin in AI
We had the opportunity to participate and present our work in the LatinX in AI full-day workshop. The workshop included a variety of quality contributions across a wide spectrum, from theory to applications, and saw a diversity of participants from all over Latin America and beyond. The keynote talk postulating the question “Can AI Be Unfair?” was a good start to the program where technology and societal/ethical concerns were brought together. Among other aspects, we believe the main point asked in the talk is whether it is possible for AI to do its job well without having access to potentially discriminatory data such as age and gender. We look at this problem as maximizing performance while subject to constraints on what information is used. Other topics included a discussion on the Reinforcement Learning (RL) Prototyping Framework (called Dopamine) aiming for flexibility, stability, and reproducibility of RL research. Two other interesting problems discussed were those of estimating the causal relationship in a network (e.g., a social network) and automating machine learning when the underlying data distribution changes (e.g., in the presence of drift). The later described and compared a few methods with preliminary but interesting results. Some of the talks, including the last two topics, can be seen in this video.
Related to the concept of AutoML, (automating the application of machine learning to real world problems), the workshop on continual learning was interesting. The workshop focused on the problem of how to learn continuously from a data stream. Two main subproblems are the ability to adapt to the new data or environment and the ability to efficiently use what was learned from the past. Some of the properties that are “desired” in these systems include online learning, transfer of past learnings to new data, and resistance to “catastrophic forgetting” (large past performance deterioration after learning from new data), among others. The workshop discussed these topics, although some of them seem subject to debate still. Overall, the workshop focused more on the evaluation of CL and its connection to other areas of machine learning. There is a general lack of agreement on how to evaluate CL algorithms, and as a result there was a focus on formulating one or more metrics. Several talks stressed the importance of computational efficiency as a metric when evaluating CL. In addition to computational efficiency, knowledge transfer and memory overhead seemed the most sensible metrics. A combination of the above metrics was also proposed along with a few benchmark datasets.
Regarding CL’s connection with other areas of ML, one argument is that it is closer to reinforcement learning, where the value function needs to be re-estimated when the rewards change. Other areas that may have some connection, such as transfer learning and multi-task learning, on the other hand, are not continuous in nature. Some people refer to CL as “lifelong learning.”
One of the most interesting workshops was on causal learning. It included a solid set of speakers such as Bernhard Schölkopf, Isabelle Guyon, Pietro Perona, and David Blei. Pietro Perona addressed the overfitting problem and how this created a substantial new adversarial attack challenge for computer vision tasks, even though benchmark dataset accuracies are significantly improved in the last decade. An interesting example was the object detection problem. Current state-of-the-art solutions can be easily fooled by adversarially perturbed (slightly changed such that the human eye cannot distinguish the difference) images. Top classifiers are still prone to the object’s background variations (e.g., a cow on the beach vs. the same cow on the grass can be classified differently). This exposes the fact that modern learning techniques overfit the data carrying background signals. This suggests that even if the accuracy levels have increased (even beating human performance on specific tasks), human learning is more causal and machine learning still has a long way to go.
David Blei presented Wang et al.’s “Deconfounder,” a causal inference method from observational data. The application is to estimate the impact of the actors on a movie’s revenue. The proposed method consists of 2 main parts: 1) a factor model, and 2) predictive checking. The main idea is that if a factor model is learned appropriately, providing a low-dimensional representation for the actors, then the joint distribution of multiple causes (e.g., the entire set of actors) can be factorized given the latent representation. This implies that the causal effect of multiple actors can be isolated from each other. There can still be other confounders per-actor, but this is still weaker than the joint multiple effects. What is unique about this work is that it provides a practical methodology for handling causal inference questions on any bi-partite graph—for example, determining which skills on a resume provide the highest employee salary or increase the chance of being hired.
Machine Learning Systems (MLSys)
There is currently significant interest in research that’s at the intersection of systems and machine learning. In 2018, the first conference on systems and machine learning (SysML) was held and a second event for 2019 is in the works. At NeurIPS, this trend continued with the MLSys workshop. The main focus area of this workshop was around building large-scale learning systems that can scale to current and future applications. There was a huge emphasis on system design and particularly software engineering. As deep learning models become more and more complex and data hungry, searching for scalable solutions handling such complexity has started to become more vital both in terms of cost-of-service as well as reasonable training times. Of note, the talks were mainly focused on offline training systems. We also found useful the lunch demo session where companies like Amazon, Google, Facebook, etc. shared their open source ML solutions for research and industry use.
AI for social good
The AI for Social Good workshop focused on what AI can do to make a positive societal impact. In addition to a great program, the panel did a good job highlighting areas that are controversial. As an example from the workshop, they called out that while there are a lot of areas where AI is helping or intended to help (the creation of workshops like this and institutes for social good are encouraging), the public is concerned about AI’s potential to negatively impact people’s lives. This goes beyond the discussion on job displacement. It more generally pertains to how applications in industry may be using AI to optimize so fiercely for company profit that they often lose sight of how this may (indirectly or directly) be at the expense of social gains. We liked to see that discussions and presentations on the role of AI in society were present in the conference well beyond this workshop. This included other workshops such as Ethical, Social, and Governance Issues in Al and tutorials such as Common Pitfalls for Studying the Human Side of Machine Learning (discussing misconceptions in the machine learning community when thinking about fairness, accountability, and transparency), and invited talks such as Machine Learning and Public Policy.
Relational representational learning
In online systems, graphs are everywhere, either explicitly as a social network or implicitly as a bipartite graph constructed from user-generated data. Incorporating relational information while solving problems over graphs, such as node classification or edge inference, can boost prediction performance or increase the quality of learned embeddings. In the relational representational learning workshop, there were several valuable talks about relational reinforcement learning, distance metrics, and bias elimination.
Nearest-neighbor search in high-dimensional space is both a common sub-step or post-stage of relational learning algorithms. The time complexity for inference is critical in web applications. Li et al. propose a significant improvement in the run-time using a custom data structure. The authors build on top of their previous work using dynamic continuous indexing. In the recent improvement, they provide a prioritized order of neighbor computation, which significantly advances the run-time during inference. Compared to locally-sensitive hashing, the proposed method improves memory requirement by 21x and time complexity by 14-116x.
NeurIPS is a leading foundational AI conference for the machine learning research that often gives birth to the next generation of AI technologies. The trends at NeurIPS reflect where the community is applying its focus and how the hardest AI problems should be approached. In this post, we presented our observations from the event in 2018; more specifically, we looked at the trends in machine learning research with a bias towards the problems in online systems and web applications.
Thanks to the NeurIPS researchers for their hard work continuing to redefine the frontier of AI research for another year.
Thanks to our peer reviewers Deepak Kumar and Mahesh Joshi, who significantly improved the quality of this blog post from early drafts. Thanks to the LinkedIn communication team and partners; their feedback was instrumental in shaping this blog post.