Tag Archives: technical

Surviving switch failures in cloud datacenters

Rachee Singh, Muqeet Mukhtar, Ashay Krishna, Aniruddha Parkhi, Jitendra Padhye, David Maltz

Abstract

Switch failures can hamper access to client services, cause link congestion and blackhole network traffic. In this study, we examine the nature of switch failures in the datacenters of a large commercial cloud provider through the lens of survival theory. We study a cohort of over 180,000 switches with a variety of hardware and software configurations and find that datacenter switches have a 98% likelihood of functioning uninterrupted for over 3 months since deployment in production. However, there is significant heterogeneity in switch survival rates with respect to their hardware and software: the switches of one vendor are twice as likely to fail compared to the others. We attribute the majority of switch failures to hardware impairments and unplanned power losses. We find that the in-house switch operating system, SONiC, boosts the survival likelihood of switches in datacenters by 1% by eliminating switch failures caused by software bugs in vendor switch OSes.

Download from ACM

The case for model-driven interpretability of delay-based congestion control protocols

Muhammad Khan, Yasir Zaki, Shiva R. Iyer, Talal Ahamd, Thomas Poetsch, Jay Chen, Anirudh Sivaraman, Lakshmi Subramanian

Abstract

Analyzing and interpreting the exact behavior of new delay-based congestion control protocols with complex non-linear control loops is exceptionally difficult in highly variable networks such as cellular networks. This paper proposes a Model-Driven Interpretability (MDI) congestion control framework, which derives a model version of a delay-based protocol by simplifying a congestion control protocol’s response into a guided random walk over a two-dimensional Markov model. We demonstrate the case for the MDI framework by using MDI to analyze and interpret the behavior of two delay-based protocols over cellular channels: Verus and Copa. Our results show a successful approximation of throughput and delay characteristics of the protocols’ model versions across variable network conditions. The learned model of a protocol provides key insights into an algorithm’s convergence properties.

Download from ACM

Experience-driven research on programmable networks

Hyojoon Kim, Xiaoqi Chen, Jack T Brassil, Jennifer Rexford

Abstract

Many promising networking research ideas in programmable networks never see the light of day. Yet, deploying research prototypes in production networks can help validate research ideas, improve them with faster feedback, uncover new research questions, and also ease the subsequent transition to practice. In this paper, we show how researchers can run and validate their research ideas in their own backyards—on their production campus networks—and we have seen that such a demonstrator can expedite the deployment of a research idea in practice to solve real network operation problems. We present P4Campus, a proof-of-concept that encompasses tools, an infrastructure design, strategies, and best practices—both technical and non-technical—that can help researchers run experiments against their programmable network idea in their own network. We use network tapping devices, packet brokers, and commodity programmable switches to enable running experiments to evaluate research ideas on a production campus network. We present several compelling data-plane applications as use cases that run on our campus and solve production network problems. By sharing our experiences and open-sourcing our P4 apps [28], we hope to encourage similar efforts on other campuses.

Download from ACM

Distrinet: a Mininet Implementation for the Cloud

Giuseppe Di Lena, Andrea Tomassilli, Damien Saucez, Frédéric Giroire, Thierry Turletti, Chidung Lac

Abstract

Networks have become complex systems that combine various concepts, techniques, and technologies. As a consequence, modelling or simulating them now is extremely complicated and researchers massively resort to prototyping techniques. Mininet is the most popular tool when it comes to evaluate SDN propositions. Mininet allows to emulate SDN networks on a single computer but shows its limitations with resource intensive experiments as the emulating host may become overloaded. To tackle this issue, we propose Distrinet, a distributed implementation of Mininet over multiple hosts, based on LXD/LXC, Ansible, and VXLAN tunnels. Distrinet uses the same API than Mininet, meaning that it is compatible with Mininet programs. It is generic and can deploy experiments on Linux clusters (e.g., Grid’5000), as well as on the Amazon EC2 cloud platform.

Download from ACM