Networking @Scale 2018 recap | Engineering Blog | Facebook Code


Recently, we hosted more than 500 engineers at the Computer History Museum in Mountain View, California for our fourth Networking @Scale conference. As with previous Networking @Scale conferences, we look for operators of some of the largest “scale” networks and ask them to share their on-the-ground experiences in designing, building, and operating the networks. This year, speakers from IBM, Microsoft, Amazon, Blizzard, and Facebook walked through the entire OSI networking stack, from layer 0 to layer 7.

We also hosted for the first time a Powerful Women in Tech lunch, where we encouraged attendees to become allies and to help make workplaces more inclusive. The networking industry has so many technical challenges ahead of it, so everyone in the community needs to make sure they are including all people who are passionate about tackling these challenges.

Finally, we strongly believe in the value of sharing and openness in the networking industry. To that end, the Facebook Network team announced and shared two projects: We open-sourced the underpinnings of our layer 4 load balancer and shared IPv6 adoption data that we’ve seen for countries around the world.

Thanks again to all the speakers and their companies for sharing their experiences with the community. If you have any questions about the talks or suggestions for future talks, please join our Networking @Scale Attendees Facebook group. For general information about our @Scale conferences, please join the community at https://www.facebook.com/atscaleevents.


Starting at layer 0, Alan Benner from IBM gave us an overview of how supercomputing has tackled high-speed optical networking over the years, something of great interest to many who are trying to build high-speed data center networks. Alan pointed out how many of the constraints such as power and space are similar between the domains. Also, both domains face the simple fact that it’s extremely hard to string up tens of thousands of fibers without making any mistakes, and how we need automation and tools to help deploy and debug these dense optical networks.

Moving up to the network layer, David Swafford from Facebook detailed a new system called Vending Machine and discussed how it has helped us automate complex, manual provisioning procedures. In the example of turning up a new Edge Point-of-Presence, there are many steps — configuration generation being just one. Keeping everyone up-to-date on the complex, long list of steps would be nearly impossible, so we built Vending Machine to provide that automation.

Also at the network layer, Elnaz Jalilipour Alishah from Microsoft described the system for modeling the design intent for their global backbone network, representing specific resilience and risk goals, and then automatically satisfying that intent using a machine learning platform. They constantly run Monte Carlo simulations to identify high-risk areas of their network and then running optimizations that find the minimum set of links that will reduce downtime for each potential failure.

Mark McKillop and Katharine Schmidtke from Facebook talked about challenges in at-scale optical networks, both in the backbone and inside the data centers. Mark covered how to bring the management of the optical backbone more in line with management of the IP backbone and data centers, with a much more automated level of topology discovery, statistics collection, and deployment. Katharine showed collected data from the optical network that shows how the next-generation, faster technology needs to use much less power and to have much higher yield during manufacturing.

Sarah Chen and Paul Saab from Facebook announced the availability of our country-specific IPv6 adoption data and walked through some of the highlights from this data, including the facts that Facebook’s IPv6 traffic from the U.S. just recently passed 50 percent and IPv6 mobile traffic is over 75 percent. Sarah also covered examples from other countries that show rapid IPv6 increases, especially tied to work by specific providers in those countries.

Alan Halachmi and Colm MacCarthaigh from Amazon went in-depth on HyperPlane, a fundamental system that underlies Amazon’s S3 Load Balancer, Elastic Filesystem, VPC NAT Gateway and PrivateLink, and more. They highlighted a key principle to take away: that we should always be operating in “repair mode,” where constant monitoring of and recovery from failures is the natural state.

Niky Riga from Facebook covered the design of Edge Fabric, Facebook’s system for managing our egress load back to the internet, as well as some recent developments and learnings. She shared results in measuring performance to destinations around the world. Niky also highlighted cases of sustained performance degradation across the primary/default path and how Edge Fabric is able to route around those as part of its control logic.

Nikita Shirokov from Facebook announced the open-sourcing of our new L4 load balancer, Katran, and covered the motivation for needing to rewrite this in order to increase scalability and flexibility, showing how using recent kernel facilities such as XDP and eBPF results in a solution that is much faster and more CPU-efficient, and at the same time doesn’t require a kernel upgrade or restart.

Philip Orwig and Malachi Middlebrook from Blizzard wrapped up the conference by giving us the application perspective of the network. In the end, our networks exist not for themselves but for the applications and the people using them. Specifically, they covered how games like Overwatch, World of Warcraft, and Diablo work around the realities and restrictions of the network in order to provide a rapid multiplayer online game experience.



Source link