Spark Summit 2017: Research, Open Source, and Community

Next Tuesday marks the start of the Spark Summit Conference in San Francisco. This year, LinkedIn engineers and data scientists are presenting four separate talks at the conference, and we’ll be hosting a meetup at our San Francisco office on the final day. All of this is an indication of the significant impact that Apache Spark has had on the way people process and analyze data at LinkedIn.

We are excited to be able to give back to the community by sharing our experiences and learnings from the past year. The rest of this post provides quick summaries of the talks, along with links to the extended abstracts.

Spark Summit presentations

Random Walks on Large Scale Graphs with Apache Spark
Tuesday, June 6 from 12:20 p.m. to 12:50 p.m. in Room 2022

Min Shen, a member of LinkedIn’s Grid Platform team, will present a talk on several novel techniques for enumerating walks on very large-scale graphs, along with specific properties of Apache Spark that were leveraged to achieve good performance and scalability. Random walks on graphs are a useful technique in machine learning, with applications in personalized PageRank, representational learning, and others. Min will also describe experiences at LinkedIn running Spark on production clusters.


Transforming B2B Sales with Spark-Powered Sales Intelligence
Tuesday, June 6 from 3:20 p.m. to 3:50 p.m. in Room 2016

In their talk, Songtao Guo and Wei Di from LinkedIn’s Business Analytics Data Mining team will discuss how they leverage Apache Spark to build sales intelligence products. They will also describe how they have used Spark-ML to construct applications for prospect prediction and prioritization, churn prediction, and model interpretation, as well as the challenges they encountered and lessons they learned along the way.


Multi-Label Graph Analysis and Computations Using GraphX
Tuesday, June 6 from 4:20 p.m. to 4:50 p.m. in Room 2022

Real-life applications often deal with situations where analysis needs to be conducted on graphs where the nodes and edges are associated with multiple labels. In this session, Qingbo Hu from LinkedIn’s Business Analytics Data Mining team and Qiang Zhu from Airbnb will describe a framework based on Apache Spark that is able to support analysis on multi-label graphs, and can be reused and extended to design more complicated algorithms. The framework includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. The speakers will explain how LinkedIn leverages this tool to efficiently compute top LinkedIn feed influencers in different communities and by different actions, as well as how common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.


Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop
Wednesday, June 7 from 11:00 a.m. to 11:30 a.m. in Room 2006

Dr. Elephant helps improve Apache Spark and Apache Hadoop developer productivity and increase cluster efficiency by making clear recommendations on how to tune workloads and configurations. Originally developed by LinkedIn, Dr. Elephant is now in use at multiple sites.

In this session, Carl Steinbach, a member of LinkedIn’s Grid Platform team, and Simon King, an engineer at Pepperdata, will explore how Dr. Elephant works, the data it collects from Apache Spark environments, and the customizable heuristics that generate tuning recommendations. Learn how Dr. Elephant can be used to improve production cluster operations, help developers avoid common issues, and greenlight applications for use on production clusters.

Source link