Large-scale machine learning tasks are necessarily run on clusters of machines “in the cloud”. The by now classical Hadoop approach to distributed processing of jobs by clusters has seen significant evolutions meant to better handle ML tasks. One notable solution is the Spark architecture, based on so-called ‘Resilient Distributed Data Sets’, and its augmentations, e.g. ‘GraphX’ for handling graph algorithms, ‘Shark’ for handling SQL queries, ‘D-streams’ for handling streaming computations.
There remains however a huge scope for further improving the performance of ML job processing in the cloud. Such improvements will come from the platform architecture side, the ML algorithms side, and the interaction between the two.
In this project we intend to focus on the ML algorithms side, and develop methods that exploit the flexibility of the cloud architecture as well as hedge against its inherent unpredictability. We have identified two sets of objectives pertaining to supervised learning on the one hand and unsupervised learning on the other hand. They are both relevant to the training of deep neural networks.
1 .Supervised learning in the cloud
The classical approaches based on convex optimization for learning a classifier from labelled data can be revisited to exploit the cloud platform. When data and computations are local to machines that communicate over a network, the usual model of computation where (potentially stochastic) gradients are accessed sequentially is not appropriate anymore. In this context, Bubeck and Lee propose to augment stochastic gradient algorithms by performing suitable computations while waiting for a new gradient estimate to be returned. These computations in turn allow to better select the next point where to move the algorithm, resulting in a significant speedup compared to the state-of-the-art implementations.
A systematic investigation will be undertaken to identify efficient algorithms, how they are affected by randomness in platform response times, and how they can adapt to it. Besides the algorithms themselves, a side product of this study will be a better understanding of the benefits that could be had by letting the platform give more control to the algorithm. For instance the platform might provide estimates on response time, or allow to change these response times via a change in data replication or job scheduling among cluster nodes.
A more formal description of the objective is as follows. The goal is to find a model of computation which (a) is flexible enough to encompass common distributed architectures, and (b) leads to an analysis which is as clean as for the serial case, in particular in terms of adaptivity to the properties of the learning problems (e.g., smoothness or strong-convexity) and, now, to the specificities of the hardware architecture (delays between machines, variations in processing speed, memory accesses, etc.).
2.Unsupervised learning in the cloud
There are also many opportunities in this space. We plan to consider in particular recently developed community detection techniques, based on either non-standard spectral methods, or semi-definite programming. Fast approaches to solving the latter, which involve a gradient descent primitive have recently been proposed. We will develop algorithms to implement efficiently the above schemes over graph analytics cloud platforms like GraphX.
-Parallel combinatorial optimisation: graph cut techniques are major tools in image segmentation (e.g., in medical imaging), with the wide-spread use of max-flow algorithms, which are hard to parallelize. Using reformulations of combinatorial optimization problems as convex problems (which are exact for submodular problems), we plan to use new distributed convex optimization techniques to provide simple parallel algorithms, which requires adaptation to be competitive.