Programmable Networks for Distributed Deep Learning: Advances and Perspectives
Training large deep learning models is challenging due to high communication overheads that data-parallelism entails. Embracing the recent technological development of programmable network devices, this talk describes our efforts to rein in distributed deep learning’s communication bottlenecks and offers an agenda for future work in this area. We demonstrate that an in-network aggregation primitive can accelerate distributed DL workloads, and can be implemented using modern programmable switch hardware. We discuss various designs for streaming aggregation and in-network data processing that lower memory requirements and exploit sparsity to maximize effective bandwidth use. We also touch on gradient compression methods, which contribute to lower communication volume and adapt to dynamic network conditions. Lastly, we consider how the rise of programmable NICs may have a role in this space.