Franck Iutzeler

ANR JCJC STROLL

Harnessing Structure in Optimization for Large-scale Learning

The growth and diversification in data collection techniques has led to tremendous changes in machine learning systems. Since the breakthrough of variance-reduced stochastic methods (SAGA, MISO [4, 12]), several research directions in optimization for learning have recently proven to be able to scale up to current challenges. Among them, let us focus on two promising trends:

Dimension Reduction. The premise of this topic is to identify pertinent directions in the variable search space and concentrate most computational eflorts onto these directions. A successful and typical example of such methods is the screening of variables for the lasso problem; for instance, one can mention the strong rules [23] used in the popular GLMNET R package or more recently the scikit-learn compatible package Celer (see https://mathurinm.github.io/celer/) [13]. Distributed Computing. Nowadays, computing clusters have become easily available, for example through Amazon Web Services; also, as handheld devices became more and more powerful, learning using local computations directly on the devices is now a trend as initiated by Google’s federated learning framework (Google AI blog post: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html) [8].

The success of both these topics is partly due to a common denominator: the optimization problem at hand is strongly structured and this structure is harnessed to produce computationally eficient methods.

People

Publications

Context

First, structure may be brought by regularization of the original learning problem, which typically enforces low-dimensional solutions. For instance, L1-norm regularization has been immensely popular since the success of the lasso problem [22] for which dimension reduction methods have proven to be highly profftable in practice as mentioned above. In addition, it is well-known that the iterates of most popular optimization methods actually become sparse in ffnite time for L1-norm regularized problems, in which case they are said to identify some sparsity pattern; at this point, the convergence of the algorithm becomes faster in both theory and practice [11]. Combining identiffcation and along-the-way dimension reduction by adapting the coordinate descent probabilities to the identiffed important ones recently spurred and produced very promising results [18]. However, identiffcation results have been extended to more general classes of regularizers (e.g. 1D Total Variation, trace norm, see [5] and references therein) for which no numerical dimension reduction methods are available yet. Eficiently harnessing more general types of identiffcation for improving the convergence of numerical optimization algorithms is Axis 1 of this project.

Secondly, distributed problems directly induce structure as the global system state lives in the product space corresponding to the replication of the master variable at the agents. Then, direct extensions of optimization methods on this kind of setup sufler from the burden of synchronization due to the fact that all local computations have to be gathered before beginning another. This bottleneck was recently tackled both algorithmically and in terms of analysis from diflerent viewpoints (stochastic ffxed point theory [3, 19], direction aggregation [24], iterates aggregation [15]). With the synchrony constraint partially lifted, an important topic is to make these systems more robust and adapted to real-life issues that can arise on a network of hand-held devices. The study of the resilience to faulty/malicious updates and the adaptation to evolving architectures/data/loss functions corresponds to Axis 2 of this project.

The two kinds of structure presented above may seem very diflerent at ffrst glance but they actually lead to similar theory and algorithms. A technical evidence of this claim is the following: standard coordinate descent on the proximal gradient algorithm imposes a separable regularizer (see [7]); joining the coordinates is the blocking point, and dealing jointly with messily updated coordinates is also what asynchronous distributed methods are about. This structural link between coordinate descent methods and distributed optimization is the central thread of this project. Exploiting this analogy, harnessing the regularization-brought structure of the problem to reduce the cost of exchanges in distributed architectures is Axis 3 of this project.

Objectives

The targets of this project are divided into three axes:

Axis 1: Faster Optimization algorithms using identiffcation.

Computationally accelerating optimization methods may be done either by reducing the cost of one iteration or performing less iterations, splitting this axis into two.

1a Reducing the iteration cost may be done by discarding variables or limiting the update to a subset of the coordinates. This proved to be very eficient in practice [13, 18]; however, to fully benefft from these methods the problem at hand has to have some coordinate-structure (i.e. natural sparsity, as brought by the L1-norm). For general regularizers, the question is still open: how to eficiently reduce iteration cost using more general structures. A solution, which can be linked to sketching [6], is to update the iterate only along some subspace chosen in an adaptive manner by screening identiffed subspaces.

1b Performing less iterations by incorporating predictive inertial steps has been immensely popular since Nesterov’s fast gradient [17] and Beck and Teboulle’s FISTA [2]. Eficient methods taking into account the local geometry of the problem have been proposed (see our recent contributions [20, 21] and references therein). However, inertial methods are still structure-blind and can actually slow down identiffcation by making the iterates leave an otherwise stable subspace. We thus aim at developing structure-adapted inertial algorithms.

Axis 2: Resilient Distributed Learning.

Tackling massive learning problems over distributed architectures calls for a change in objective formulation, new mathematical tools, and new performance measures. This need stems from the dynamics and heterogeneity of modern distributed learning systems: network of hand-held devices, scattered data-servers, etc. In addition, the idea of extracting information globally, at scale, from local contributions naturally links with the problems of data/model privacy and their representation to make the whole system less prone to attacks and robust to loosely connected agents. The target of this research axis is to build on our recent progress in distributed optimization [14, 15] to produce robust distributed learning systems.

2a First, an important issue is to correctly handle the arrival of new information in the system; leading to the problem of distributed asynchronous online learning, and its robustness to adversarial players.

2b A second goal is to change the structure of the problem at hand and forget about minimizing the full loss as on a single machine but rather some distributed loss adapted to the system. A direction to construct such a loss is to build on the connected ideas of i) the geometric median of the local minimizers of the local losses, i.e. the median of means (see e.g. [16] and the recent [10]), and ii) the proximal average ([1] and an application in learning [26]) which can be seen as the function minimized if the agents would perform full minimization of some regularization of their loss and average the result. A common denominator of these techniques is that they are more focused on local rather than global minimization; furthermore, the median of means can naturally lead to robust distributed systems.

Axis 3: Communication-eficient Distributed methods.

In large-scale learning, stochastic optimization algorithms are very popular, but their parallel counterparts (e.g. [9]) are highly iterative and thus require low-latency and high-throughput connections. The distributed systems considered in this project have signiffcantly higher-latency, less reliable, and lower-throughput connections which rehabilitates batch algorithms and makes communications the practical bottleneck of the learning process [25].

3a First, as part of the Ph.D. of Dmitry Grishchenko, we work on proposing sparsiffcation techniques for distributed asynchronous optimization. However, direct randomized sparsiffcation might end up being excessively slow in terms of minimization of the objective; but adapting the sparsiffcation to the identiffed iterates structure gave very encouraging preliminary results (unpublished, presented at SMAI-MODE 2018 and ISMP 2018). Designing automatic dimension reduction methods to make distributed systems communication-eficient for a large class of regularized learning problems is still an exciting challenge which can take full profit of part 1a of the project.

3b Then, mirroring 2a,b, communications in the system can be drastically decreased by shifting from optimization-adapted communications to learning-adapted communications. Indeed, useful models can be obtained without a high-precision solution of the global risk and with very limited communications. For instance, if all the agents compute the solution of their local risk minimization and send it to a master machine, it can compute the median-of-means global model and thus have robust global information from one exchange with the agents.

References

[1] H. H. Bauschke, R. Goebel, Y. Lucet, and X. Wang. The proximal average: basic theory. SIAM Journal on Optimization, 19(2):766–785, 2008.

[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.

[3] P. Bianchi, W. Hachem, and F. Iutzeler. A coordinate descent primal-dual algorithm and application to distributed asynchronous optimization. IEEE Transactions on Automatic Control, 61(10):2947–2957, 2016.

[4] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.

[5] J. Fadili, J. Malick, and G. Peyré. Sensitivity analysis for mirror-stratiffable convex functions. arXiv:1707.03194, 2017.

[6] R. M. Gower, P. Richtárik, and F. Bach. Stochastic quasi-gradient methods: Variance reduction via jacobian sketching. arXiv:1805.02632, 2018.

[7] F. Hanzely, K. Mishchenko, and P. Richtarik. Sega: Variance reduction via gradient sketching. arXiv:1809.03054, 2018.

[8] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication eficiency. arXiv:1610.05492, 2016.

[9] R. Leblond, F. Pedregosa, and S. Lacoste-Julien. Asaga: Asynchronous parallel saga. In Artiffcial Intelligence and Statistics, pages 46–54, 2017.

[10] G. Lecué and M. Lerasle. Robust machine learning by median-of-means: theory and practice. arXiv:1711.10306, 2017.

[11] J. Liang, J. Fadili, and G. Peyré. Local linear convergence of forward–backward under partial smoothness. In Advances in Neural Information Processing Systems, pages 1970–1978, 2014.

[12] J. Mairal. Optimization with ffrst-order surrogate functions. In International Conference on Machine Learning, pages 783–791, 2013.

[13] M. Massias, J. Salmon, and A. Gramfort. Celer: a fast solver for the lasso with dual extrapolation. In International Conference on Machine Learning, pages 3321–3330, 2018.

[14] K. Mishchenko, F. Iutzeler, and J. Malick. A distributed ffiexible delay-tolerant proximal gradient algorithm. arXiv:1806.09429, 2018.

[15] K. Mishchenko, F. Iutzeler, J. Malick, and M.-R. Amini. A delay-tolerant proximal-gradient algorithm for distributed learning. In International Conference on Machine Learning, pages 3584–3592, 2018.

[16] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method eficiency in optimization. 1983.

[17] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.

[18] J. Nutini, I. Laradji, and M. Schmidt. Let’s make block coordinate descent go fast: Faster greedy rules, message-passing, active-set complexity, and superlinear convergence. arXiv:1712.08859, 2017.

[19] T. Sun, R. Hannah, and W. Yin. Asynchronous coordinate descent under more realistic assumptions. In Advances in Neural Information Processing Systems, pages 6183–6191, 2017.

[20] F. Iutzeler and J. M. Hendrickx. A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optimization Methods and Software, pages 1–23, 2017.

[21] F. Iutzeler and J. Malick. On the proximal gradient algorithm with alternated inertia. Journal of Optimization Theory and Applications, 176(3):688–710, 2018.

[22] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

[23] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):245–266, 2012.

[24] N. D. Vanli, M. Gurbuzbalaban, and A. Ozdaglar. Global convergence rate of proximal incremental aggregated gradient methods. SIAM Journal on Optimization, 28(2):1282–1300, 2018.

[25] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsiffcation for communication-eficient distributed optimization. arXiv:1710.09854, 2017.

[26] Y.-L. Yu. Better approximation and faster algorithm using the proximal average. In Advances in neural information processing systems, pages 458–466, 2013.