Systematic Communication Acceleration for Distributed DNN Training
This talk discusses how to accelerate communication of distributed DNN training from the first principles of DNN model training. The objective of communication acceleration is to reduce training time under the same amount of GPU computation resources. We achieve this objective by systematically exploring the acceleration space in communication transport, scheduling, and topology. First, for communication transport, we use RDMA/RoCEv2 and address a set of engineering challenges to accelerate the tensor transmission speed. Second, for communication scheduling, we find the optimal way of overlapping communication and computation, so that the communication time can be hidden as much as possible. Third, for communication topology, we introduce a new framework which integrates Parameter Server and all-reduce as its two special cases. Our new framework achieves optimal network capacity whereas neither Parameter Server nor all-reduce can. We integrate all our findings in BytePS, which is open sourced.