With the rapid growth scale of dataset and model, the training of deep neural networks (DNN) tends to be deployed in a distributed manner. In the large-scale distributed training, the bottlenecks have gradually moved from computational resources to communication process. Recent researches adopt in-network aggregation (INA) that offloads the gradient aggregation process to programmable switches, thereby reducing network traffic amount and transmission latency. Unfortunately, due to the bandwidth competition in shared training clusters, the straggler will slow down the training efficiency of INA. To address this issue, we propose an Asynchronous Control based Aggregation Transport Protocol (AC-ATP), which makes full use uncongested links to transmit gradients and the switch memory to cache gradients from the fast workers to accelerate the gradient aggregation. Meanwhile, AC-ATP performs congestion control according to the transmission progress of worker and the remaining completion time of the job. The evaluation results of real testbed and large-scale simulations show that AC-ATP reduces the aggregate time by up to 68% and speeds up training in real-world benchmark models.
Support the authors with ResearchCoin