The success of both these topics is partly due to a common denominator: the optimization problem at hand is
strongly structured and this structure is harnessed to produce computationally eficient methods.
Secondly, distributed problems directly induce structure as the global system state lives in the product
space corresponding to the replication of the master variable at the agents. Then, direct extensions
of optimization methods on this kind of setup sufler from the burden of synchronization due to the
fact that all local computations have to be gathered before beginning another. This bottleneck was
recently tackled both algorithmically and in terms of analysis from diflerent viewpoints (stochastic
ffxed point theory [3, 19], direction aggregation , iterates aggregation ). With the synchrony
constraint partially lifted, an important topic is to make these systems more robust and adapted to real-life
issues that can arise on a network of hand-held devices. The study of the resilience to faulty/malicious
updates and the adaptation to evolving architectures/data/loss functions corresponds to Axis 2 of this
The two kinds of structure presented above may seem very diflerent at ffrst glance but they actually lead to
similar theory and algorithms. A technical evidence of this claim is the following: standard coordinate descent on
the proximal gradient algorithm imposes a separable regularizer (see ); joining the coordinates is
the blocking point, and dealing jointly with messily updated coordinates is also what asynchronous
distributed methods are about. This structural link between coordinate descent methods and distributed
optimization is the central thread of this project. Exploiting this analogy, harnessing the regularization-brought
structure of the problem to reduce the cost of exchanges in distributed architectures is Axis 3 of this
 H. H. Bauschke, R. Goebel, Y. Lucet, and X. Wang. The proximal average: basic theory. SIAM Journal on
Optimization, 19(2):766–785, 2008.
 A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM
journal on imaging sciences, 2(1):183–202, 2009.
 P. Bianchi, W. Hachem, and F. Iutzeler. A coordinate descent primal-dual algorithm and application to
distributed asynchronous optimization. IEEE Transactions on Automatic Control, 61(10):2947–2957, 2016.
 A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for
non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654,
 J. Fadili, J. Malick, and G. Peyré. Sensitivity analysis for mirror-stratiffable convex functions.
 R. M. Gower, P. Richtárik, and F. Bach. Stochastic quasi-gradient methods: Variance reduction via jacobian
sketching. arXiv:1805.02632, 2018.
 F. Hanzely, K. Mishchenko, and P. Richtarik. Sega: Variance reduction via gradient sketching.
 J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning:
Strategies for improving communication eficiency. arXiv:1610.05492, 2016.
 R. Leblond, F. Pedregosa, and S. Lacoste-Julien. Asaga: Asynchronous parallel saga. In Artiffcial Intelligence
and Statistics, pages 46–54, 2017.
 G. Lecué and M. Lerasle. Robust machine learning by median-of-means: theory and practice. arXiv:1711.10306,
 J. Liang, J. Fadili, and G. Peyré. Local linear convergence of forward–backward under partial smoothness.
In Advances in Neural Information Processing Systems, pages 1970–1978, 2014.
 J. Mairal. Optimization with ffrst-order surrogate functions. In International Conference on Machine Learning,
pages 783–791, 2013.
 M. Massias, J. Salmon, and A. Gramfort. Celer: a fast solver for the lasso with dual extrapolation. In
International Conference on Machine Learning, pages 3321–3330, 2018.
 K. Mishchenko, F. Iutzeler, and J. Malick. A distributed ffiexible delay-tolerant proximal gradient algorithm.
 K. Mishchenko, F. Iutzeler, J. Malick, and M.-R. Amini. A delay-tolerant proximal-gradient algorithm for
distributed learning. In International Conference on Machine Learning, pages 3584–3592, 2018.
 A. S. Nemirovsky and D. B. Yudin. Problem complexity and method eficiency in optimization. 1983.
 Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In
Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
 J. Nutini, I. Laradji, and M. Schmidt. Let’s make block coordinate descent go fast: Faster greedy rules,
message-passing, active-set complexity, and superlinear convergence. arXiv:1712.08859, 2017.
 T. Sun, R. Hannah, and W. Yin. Asynchronous coordinate descent under more realistic assumptions. In
Advances in Neural Information Processing Systems, pages 6183–6191, 2017.
 F. Iutzeler and J. M. Hendrickx. A generic online acceleration scheme for optimization algorithms via
relaxation and inertia. Optimization Methods and Software, pages 1–23, 2017.
 F. Iutzeler and J. Malick. On the proximal gradient algorithm with alternated inertia. Journal of Optimization
Theory and Applications, 176(3):688–710, 2018.
 R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B
(Methodological), pages 267–288, 1996.
 R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules
for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 74(2):245–266, 2012.
 N. D. Vanli, M. Gurbuzbalaban, and A. Ozdaglar. Global convergence rate of proximal incremental aggregated
gradient methods. SIAM Journal on Optimization, 28(2):1282–1300, 2018.
 J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsiffcation for communication-eficient distributed
optimization. arXiv:1710.09854, 2017.
 Y.-L. Yu. Better approximation and faster algorithm using the proximal average. In Advances in neural
information processing systems, pages 458–466, 2013.