Organized by Emmanuel Abbe (Princeton University), Sébastien Bubeck (MSR), Marc Lelarge (Inria) and Laurent Massoulié (MSR-Inria Joint Centre).

The workshop will be devoted to questions of learning, information and complexity in the context of networks. Specific topics of interest include: inference of network structure, community detection, synthesis of efficient random forests and deep neural networks, approaches from statistical physics, and limit objects of graph sequences.

**Attendance is open and free. As the number of registrations has reached the amphitheater’s capacity, we cannot guarantee that there will be room to fit would-be attendants who have not registered.**

Program (current version, subject to adjustments):

Wednesday May 18 | Thursday May 19 | Friday May 20 |

9:20 participants welcome | ||

9:30 Léon Bottou
Beyond statistical machine learning (slides) |
9:30 Gérard Biau
Collaborative Inference (slides) |
9:30 Jennifer Chayes
Graphons and Machine Learning: Modeling and Estimation of Sparse Massive Networks, Part I |

10:20 Yann Ollivier
Invariance principles for robust learning. An illustration with recurrent neural networks |
10:20 Devavrat Shah
Blind regression (slides) |
10:20 Christian Borgs
Graphons and Machine Learning: Modeling and Estimation of Sparse Massive Networks, Part II |

11:10-11:20 Coffee break | 11:10-11:20 Coffee break | 11:10-11:20 Coffee break |

11:20 Ronen Eldan
The power of depth for feedforward neural networks |
11:20 Sylvain Arlot
Analysis of purely random forest bias (slides) |
11:20 Amin Coja-Oghlan
The k-core revisited (slides) |

12:10-2:00 Lunch break and poster session 1 | 12:10-2:00 Lunch break and poster session 2 | 12:10-2:00 Lunch break |

2:00 Bruce Hajek
Algorithms and the computational gap for recovering a single community in a network (slides) |
2:00 Cris Moore
Information-theoretic bounds and phase transitions in community detection and high-dimensional clustering |
2:00 Yuval Peres
Rigidity and Tolerance for Perturbed Lattices |

2:50 Ravi Kannan
The planted Gaussian problem (slides) |
2:50 Elchanan Mossel
Shotgun assembly of graphs (slides) |
2:50 François Baccelli
Dynamical Systems on Point Processes and Geometric Routing in Stochastic Networks |

3:40-4:00 Coffee break | 3:40-4:00 Coffee break | 3:40-4:00 Coffee break |

4:00 Riccardo Zecchina
Accessible dense regions of solutions in the weights space of neural networks: general algorithmic schemes from a non-Gibbs measure |
4:00 Lenka Zdeborová
Solvable model of unsupervised feature learning (slides) |
4:00 Florent Krzakala
Mutual information and Phase transitions in Low Rank matrix Factorizations (slides) |

4:50 Ulrike von Luxburg
Geometry of unweighted k-nearest neighbor graphs |
4:50 Andrea Montanari
Phase transitions in semi-definite programming |
4:50 Ruediger Urbanke
TBD |

5:50 Workshop close | ||

Poster presenters:

**Caterina de Bacco**, Santa Fe Institute:

Community detection on multilayer networks

**Jess Banks**, Santa Fe Institute:

Information-theoretic thresholds for community detection in sparse networks

**Laura Florescu**, New York University:

Spectral thresholds in the bipartite SBM

**Mohsen Bayati**, Stanford University:

Online Decision-Making with High-Dimensional Covariates

**Robin Lamarche-Perrin**, UPMC:

Information-theoretic Compression of Weighted Graphs (joint work with Lionel Tabourier and Fabien Tarissan)

**Frederik Mallmann**, ENS:

Distance in the Forest Fire Model: How far are you from Eve?

**Jiaming Xu**, Simons Institute, UC Berkeley:

TBD

Titles and abstracts:

**Sylvain Arlot (U. Paris Sud): Analysis of purely random forests bias**.

Abstract: Random forests (Breiman, 2001) are a very effective and commonly used statistical method, but their full theoretical analysis is still an open problem.

As a first step, simplified models such as purely random forests have been introduced, in order to shed light on the good performance of Breiman’s random forests.

In the regression framework, the quadratic risk of a purely random forest can be written as the sum of two terms, which can be understood as an approximation error and an estimation error. Robin Genuer (2010) studied how the estimation error decreases when the number of trees increases for some specific model. In this talk, we study the approximation error (the bias) of some purely random forest models in a regression framework, focusing in particular on the influence of the size of each tree and of the number of trees in the forest.

Under some regularity assumptions on the regression function, we show that the bias of an infinite forest decreases at a faster rate (with respect to the size of each tree) than a single tree. As a consequence, infinite forests attain a strictly better risk rate (with respect to the sample size) than single trees.

This talk is based on a joint work with Robin Genuer. http://arxiv.org/abs/1407.3939

**François Baccelli (INRIA- UT Austin): Dynamical Systems on Point Processes and Geometric Routing in Stochastic Networks**

Abstract: This talk is motivated by the study of geometric routing algorithms used for navigating stationary point processes. The mathematical abstraction for such a navigation is a class of non-measure preserving dynamical systems on counting measures called point-maps. The talk will focus on two objects associated with a point map $f$ acting on a stationary point process $\Phi$:

* The $f$-probabilities of $\Phi$, which can be interpreted as the stationary regimes of the routing algorithm $f$ on $\Phi$. These probabilities are defined from the compactification of the action of the semigroup of point-map translations on the space of Palm probabilities. The $f$-probabilities of $\Phi$ are not always Palm distributions.

* The $f$-foliation of $\Phi$, a partition of the support of $\Phi$ which is the discrete analogue of the stable manifold of $f$, i.e., the leaves of the foliation are the points of $\Phi$ with the same asymptotic fate for $f$. These leaves are not always stationary point processes. There always exists a point-map allowing one to navigate the leaves in a measure-preserving way.

**Gérard Biau (UPMC): Collaborative Inference**

(based on joint work with B. Cadre and K. Bleakley)

Abstract: The statistical analysis of massive and complex data sets will require the development of algorithms that depend on distributed computing and collaborative inference. Inspired by this, we propose a collaborative framework that aims to estimate the unknown mean $\theta$ of a random variable $X$. In the model we present, a certain number of calculation units, distributed across a communication network represented by a graph, participate in the estimation of $\theta$ by sequentially receiving independent data from $X$ while exchanging messages via a stochastic matrix $A$ defined over the graph. We give precise conditions on the matrix $A$ under which the statistical precision of the individual units is comparable to that of a (gold standard) virtual centralized estimate, even though each unit does not have access to all of the data. We show in particular the fundamental role played by both the non-trivial eigenvalues of $A$ and the Ramanujan class of expander graphs, which provide remarkable performance for moderate algorithmic cost.

**Christian Borgs (MSR): Graphons and Machine Learning: Modeling and Estimation of Sparse Massive Networks – Part II**

Abstract: There are numerous examples of sparse massive networks, in particular the Internet, WWW and online social networks. How do we model and learn these networks? In contrast to conventional learning problems, where we have many independent samples, it is often the case for these networks that we can get only one independent sample. How do we use a single snapshot today to learn a model for the network, and therefore be able to predict a similar, but larger network in the future? In the case of relatively small or moderately sized networks, it’s appropriate to model the network parametrically, and attempt to learn these parameters. For massive networks, a non-parametric representation is more appropriate. In this talk, we first review the theory of graphons, developed over the last decade to describe limits of dense graphs, and the more the recent theory describing sparse graphs of unbounded average degree, including power-law graphs. We then show how to use these graphons as non-parametric models for sparse networks. Finally, we show how to get consistent estimators of these non-parametric models, and moreover how to do this in a way that protects the privacy of individuals on the network.

Part I of this talk reviews the theory of graph convergence for dense and sparse graphs. Part II uses the results of Part I to model and estimate massive networks.

**Léon Bottou (Facebook):** **Beyond statistical machine learning**

Abstract: The assumption that both training and test examples are sampled from the same distribution plays a central role in the theory of machine learning. This assumption used to be reasonable for many applications of machine learning. This presentation first provides evidence that the current practice is increasingly departing from this ideal. This is especially true in the deep learning community. Since this situation generates interesting conceptual challenges, the rest of the presentation describes my attempts to identify the fundamental causes of this departure, to describe the corresponding challenges and opportunities, and to predict the conceptual changes that are required to understand “post-statistical” machine learning and exploit the opportunities it offers.

** Jennifer Chayes (MSR): Graphons and Machine Learning: Modeling and Estimation of Sparse Massive Networks – Part I**

Abstract: There are numerous examples of sparse massive networks, in particular the Internet, WWW and online social networks. How do we model and learn these networks? In contrast to conventional learning problems, where we have many independent samples, it is often the case for these networks that we can get only one independent sample. How do we use a single snapshot today to learn a model for the network, and therefore be able to predict a similar, but larger network in the future? In the case of relatively small or moderately sized networks, it’s appropriate to model the network parametrically, and attempt to learn these parameters. For massive networks, a non-parametric representation is more appropriate. In this talk, we first review the theory of graphons, developed over the last decade to describe limits of dense graphs, and the more the recent theory describing sparse graphs of unbounded average degree, including power-law graphs. We then show how to use these graphons as non-parametric models for sparse networks. Finally, we show how to get consistent estimators of these non-parametric models, and moreover how to do this in a way that protects the privacy of individuals on the network.

Part I of this talk reviews the theory of graph convergence for dense and sparse graphs. Part II uses the results of Part I to model and estimate massive networks.

**Amin Coja-Oghlan (Mathematics Institute, Goethe University, Frankfurt): The k-core revisited**

Abstract: The k-core of a graph is the maximum subgraph of minimum degree k. In an important paper Pittel, Wormald and Spencer (JCTB 1996) identified the threshold for the emergence of a non-empty k-core in the random graph G(n,m) as well as the asymptotic size of the core beyond the threshold. This was followed up by a sequence of papers that determined, among other things, the asymptotic distribution of the number of vertices and edges in the k-core. In this talk, which is based on joint work with Oliver Cooley, Mihyun Kang and Kathrin Skubch, I will present a novel approach to the k-core problem that employs local weak convergence and the Warning Propagation message passing scheme.

** Ronen Eldan (University of Washington): The Power of Depth for Feedforward Neural Networks.**

Abstract: We construct simple functions on $\reals^d$, expressible by small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for a large class of activation functions, including rectified linear units and sigmoids, and is a formal demonstration that depth — even if increased by 1 — can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different. Joint work with Ohad Shamir.

**Bruce Hajek (UIUC): Algorithms and the Computational Gap for Recovering a Single Community in a Network**

Abstract: The stochastic block model for one community is considered. Out of n vertices a community consisting of K vertices is drawn uniformly at random; two vertices are connected by an edge with probability p if they both belong to the community and with probability q otherwise, where p > q > 0 and p/q is assumed to be bounded. A critical role is played by the effective signal-to-noise ratio K^2(p-q)^2/((n-K)q). The main focus of this work is on weak recovery, with o(K) misclassified vertices on average, in the sublinear regime K=o(n). It is explained that most results readily translate to the problem of exact recovery with high probability if a linear time voting procedure is used for cleanup. It is shown that weak recovery is provided by a belief propagation algorithm running in near linear time if the signal-to-noise ratio exceeds 1/e, and, conversely, if the signal-to-noise ratio is less than 1/e then no local algorithm can provide weak recovery. It is argued that the belief propagation algorithm outperforms spectral methods by a factor of e in signal-to-noise ratio. Similar results are obtained for the problem with Gaussian observations with an elevated mean for pairs of vertices in the same community. It is found that these iterative algorithms, and also a semidefinite programming (SDP) algorithm, fall short of the information theoretic limit for recovery for K=o(n/\log n) and achieve the information limit for K=\omega(n/\log n).

Joint work with Yihong Wu and Jiaming Xu.

**Ravi Kannan (MSR): The planted Gaussian problem**

Abstract: Suppose A is a n by n matrix whose random entries are each N(0,1) except in a k by k principal minor (called the planted part) where they are each N(\mu,\sigma^2). For what values of \mu,\sigma, can we recover the planted part even when k \in o(\sqrt n) ? For \mu =0, we will show that if \sigma^2 \geq 2, indeed, we can do this. Also, we show that if \mu=0 and \sigma^2 < 2, then no polynomial time statistical algorithm can do this. We consider questions with other values of \mu,\sigma^2 and the connection, if any, of the problem to the Planted Clique problem and the spiked vector model.

Joint work with Santosh Vempala

** Florent Krzakala (ENS): Mutual information and Phase transitions in Low Rank matrix Factorizations**

Abstract: We consider a probabilistic estimation of a low-rank matrix from non-linear element-wise measurements, a problem to which sparse PCA, sub-matrix localization, or the detection of communities hidden in a dense stochastic block model, can be mapped. Using the cavity method from statistical physics, we show hat the MMSE depends on the output channel only trough a single parameter: its Fisher information. We characterize the minimum mean squared error (MMSE) achievable information theoretically and with the message passing algorithm. Finally, we also show how to prove rigorously a large part of these results.

**Andrea Montanari (Stanford University): Phase transitions in semidefinite programming **

Abstract: Semidefinite programming (SDP) relaxations are among the most powerful algorithmic tools for solving high-dimensional statistical estimation problems. They are surprisingly well-suited for a broad range of tasks where data take the form of matrices or graphs. It has been observed several times that, when the ‘statistical noise’ is small enough, SDP relaxations correctly detect the underlying combinatorial structures.

I will present asymptotic predictions for several ‘detection thresholds,’ as well as for the estimation error above these thresholds. In particular, I will consider classical SDP relaxations for statistical problems motivated by graph synchronization and community detection in networks.

[Based on joint papers with Suhabrata Sen, and with Adel Javanmard and Federico Ricci-Tersenghi]

** Cris Moore (Santa Fe Institute): Information-theoretic bounds and phase transitions in community detection and high-dimensional clustering**

Over the past decade, a rich interaction has grown up between statistical physics and statistical inference. In particular, physics often predicts phase transitions where, if a data set becomes too sparse or too noisy, it suddenly becomes impossible to find its underlying structure, or even to distinguish it from a “null model” with no structure at all. For instance, in the community detection problem in networks, there is a transition below which the vertices cannot be labeled better than chance, and the network cannot be distinguished from an Erdos-Renyi random graph. Similarly, in the clustering problem in Gaussian mixture models, it becomes impossible to find the clusters, or even to tell whether clusters exist, i.e., to distinguish the data from a single Gaussian.

Many of the physics predictions have been made rigorous, but there is still a very lively interchange between the fields, both about these transitions and about asymptotically optimal algorithms. In particular, while efficient message-passing and spectral algorithms are known that succeed down to the so-called Kesten-Stigum bound, in a number of problems we believe that it is information-theoretically possible (but perhaps not computationally feasible) to go further. I will give a friendly introduction to this field, and present some new results proving this conjecture.

This is joint work with Jess Banks, Joe Neeman, Praneeth Netrapalli, and Jiaming Xu.

**Elchanan Mossel (Wharton University of Pennsylvania): Shotgun Assembly of Graphs**

Abstract: We will present some results and some open problems related to shotgun assembly of graphs for random generating models.

Shotgun assembly of graphs is the problem of recovering a random graph or a randomly labelled graphs from small pieces.

The question of shotgun assembly presents novel problems in random graphs, percolation, and random constraint satisfaction problems.

Based on joint works with Nathan Ross, Nike Sun and Uri Feige.

**Joe Neeman (UT Austin): Robust reconstruction and optimal detection in the stochastic block model**

Abstract: Consider a two-type branching process with Poisson offspring distribution. In the tree reconstruction problem, we get to observe the types of the current generation and we try to guess the type of their long-ago ancestor. In the robust tree reconstruction problem, we only get noisy observations of the current generation’s types. Remarkably, this extra noise doesn’t seem to affect the accuracy with which we can guess the ancestor’s type. We will discuss this phenomenon and an application to community detection in the stochastic block model.

**Yann Ollivier (CNRS): Invariance principles for robust learning. An illustration with recurrent neural networks.**

Abstract: The optimization methods used to learn models of data are often not invariant under simple changes in the representation of data or of intermediate variables. For instance, for neural networks, using neural activities in [0;1] or in [-1;1] can lead to very different final performance even though the two representations are isomorphic. Here we show how information theory, together with a Riemannian geometric viewpoint emphasizing independence from the details of data representation, leads to new, scalable algorithms for training models of sequential data, which detect more complex patterns and use fewer training samples.

For the talk, no familiarity will be assumed with Riemannian geometry, neural networks, information theory, or statistical learning.

**Yuval Peres (MSR): Rigidity and Tolerance for Perturbed Lattices**

Abstract: Consider a perturbed lattice {v+Y_v} obtained by adding IID d-dimensional Gaussian variables {Y_v} to the lattice points in Z^d.

Suppose that one point, say Y_0, is removed from this perturbed lattice; is it possible for an observer, who sees just the remaining points, to detect that a point is missing?

In one and two dimensions, the answer is positive: the two point processes (before and after Y_0 is removed) can be distinguished by counting points in a large ball and averaging over its radius (cf. Sodin-Tsireslon (2004) and Holroyd and Soo (2011) ). The situation in higher dimensions is more delicate, as this counting approach fails; our solution depends on a game-theoretic idea, in one direction, and on the unpredictable paths constructed by Benjamini, Pemantle and the speaker (1998), in the other. At the end of the talk I will discuss the (speculative) relevance of the solution to the planted clique problem. (Joint work with Allan Sly).

**Devavrat Shah (MIT): Blind Regression**

Abstract: We introduce the framework of *Blind Regression *motivated by the problem of *Matrix Completion *for recommendation systems: given n users and m movies, the goal is to predict the unknown rating of a user for a movie using known observations, i.e. completing the partially observed matrix. Following the framework of non-parametric statistics, we posit that user u and movie i have features x1 (u) and x2 (i) respectively, and their corresponding rating yui = f (x1 (u), x2 (i)) for some unknown function f. In the setting of classical regression, for each known rating yui, the features x = (x1 (u), x2 (i)) are observed and the goal is to estimate function f using this data so as to predict the unknown ratings. In the setting of Matrix Completion, we *do not *observe the features, thus termed *Blind Regression*. This makes it challenging to predict the rating for an unknown user-movie pair.

Using inspiration from the classical Taylor’s expansion for differentiable functions, we provide a prediction algorithm that is consistent for all Lipschitz continuous functions. We provide finite sample analysis that suggests that even when observing a vanishing fraction of the matrix, the algorithm produces accurate predictions. Connections to classical collaborative filtering algorithm as well as Tensor completion will be explained.

Joint work with Christina Lee, Yihua Li and Dogyoon Song (MIT)

**Ruediger Urbanke (EPFL): TBD**

**Ulrike von Luxburg (Universität Tübingen): Geometry of unweighted k-nearest neighbor graphs**

Consider an unweighted k-nearest neighbor graph that has been built on a random sample from some unknown density p on R^d. Assume we are given nothing but the unweighted (!) adjacency matrix of the graph: we know who are the k nearest neighbors of whom, but we do not know the point locations or any distances or similarity values between the points. Is it then possible to recover the original point configuration or estimate the underlying density p, just from the adjacency matrix of the unweighted graph? I will present the answer to the question in my talk, and well as relations to the problem of ordinal embedding.

**Lenka Zdeborova (CEA): Solvable model of unsupervised feature learning**

Abstract: We introduce factorization of certain randomly generated matrices as a probabilistic model

for unsupervised feature learning. Using the replica method we obtain an exact closed formula

for the minimum mean-squared error and sample complexity for reconstruction of the features

from data. An alternative way to derive this formula is via state evolution of the associated

approximate message passing. Analyzing the fixed points of this formula we identify a gap

between computationally tractable and information theoretically optimal inference, associated to a first order phase transition. The model can serve for benchmarking generic-purpose feature learning algorithms, its analysis provides insight into theoretical understanding of unsupervised learning from data.

**Riccardo Zecchina (Politecnico di Torino): TBD**