DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization lin.xiao@microsoft.com
Lin Xiao
Microsoft Research AI
Redmond, WA 98052, USA
weiyu@cs.cmu.edu
Adams Wei Yu
Machine Learning Department, Carnegie Mellon University
Pittsburgh, PA 15213, USA
qihang-lin@uiowa.edu
Qihang Lin
Tippie College of Business, The University of Iowa
Iowa City, IA 52245, USA
wzchen@microsoft.com
Weizhu Chen
Microsoft AI and Research
Redmond, WA 98052, USA
October 13, 2017
Abstract
Machine learning with big data often involves large optimization models. For distributed optimization over a cluster of machines, frequent communication and synchronization of all model parameters
(optimization variables) can be very costly. A promising solution is to use parameter servers to store different subsets of the model parameters, and update them asynchronously at different machines using local datasets. In this paper, we focus on distributed optimization of large linear models with convex loss functions, and propose a family of randomized primal-dual block coordinate algorithms that are especially suitable for asynchronous distributed implementation with parameter servers. In particular, we work with the saddle-point formulation of such problems which allows simultaneous data and model partitioning, and exploit its structure by doubly stochastic coordinate optimization with variance reduction (DSCOVR). Compared with other first-order distributed algorithms, we show that DSCOVR may require less amount of overall computation and communication, and less or no synchronization. We discuss the implementation details of the DSCOVR algorithms, and present numerical experiments on an industrial distributed computing system.
Keywords: asynchronous distributed optimization, parameter servers, randomized algorithms,
saddle-point problems, primal-dual coordinate algorithms, empirical risk minimization
1. Introduction
Algorithms and systems for distributed optimization are critical for solving large-scale machine learning problems, especially when the dataset cannot fit into the memory or storage of a single machine. In this paper, we consider distributed optimization problems of the form minimize w "R d