Session II.4 - Foundations of Data Science and Machine Learning
Talks
Thursday, June 15, 14:30 ~ 15:00
On structured linear measurements for tensor data recovery
Liza Rebrova
Princeton University, USA - This email address is being protected from spambots. You need JavaScript enabled to view it.
Data-oblivious measurements present an important branch of low-rank data compression and recovery techniques, frequently used in streaming settings and within iterative algorithms. Typically, linear data-oblivious measurements involve some version of a random sketch that preserves the geometric properties of the data. When data is tensorial, a special challenge is to create a sketch with a structure that reflects tensor structure: this way, it can work similarly to a dense random sketch matrix but require much less memory to store and can be applied more efficiently. I will talk about our and others -- including Tropp, Udell et al, De Lathauwer et al -- recently proposed streaming sketch-based approaches for computing low-rank Tucker approximations of large tensors. I will discuss our new generalized theoretical guarantees for proving their accuracy on full-rank and noisy data with high probability from a wide range of measurements.
Thursday, June 15, 15:00 ~ 15:30
Dimension-free limits of stochastic gradient descent for two-layers neural networks
Bruno Loureiro
École Normale Supérieure, France - This email address is being protected from spambots. You need JavaScript enabled to view it.
Stochastic gradient descent and its variants are the workhorse of modern machine learning models. Despite its current (and successful) use for optimising non-convex problems associated with the training of heavily overparametrised models, most of our theoretical understanding is bound to the context of convex problems. In this talk, I will discuss some recent progress in understanding the SGD dynamics of perhaps one of the simplest non-convex problems: two-layers neural networks. In particular, I will discuss different regimes (classical, high-dimensional and overparametrised) where one can derive a set of low-dimensional “state evolution” equations describing the evolution of the sufficient statistics for the weights. Finally, I discuss some interesting behaviour associated to each regime, and the connections to other descriptions such as the mean-field limit.
Thursday, June 15, 15:30 ~ 16:00
To split or not to split that is the question: From cross validation to debiased machine learning
Morgane Austern
Harvard, United States - This email address is being protected from spambots. You need JavaScript enabled to view it.
Data splitting is an ubiquitous method in statistics with examples ranging from cross validation to cross-fitting. However, despite its prevalence, theoretical guidance regarding its use is still lacking. In this talk we will explore two examples and establish an asymptotic theory for it. In the first part of this talk, we study the cross-validation method, a ubiquitous method for risk estimation, and establish its asymptotic properties for a large class of models and with an arbitrary number of folds. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds for the cross-validated risk, which enable us to compute asymptotically accurate confidence intervals. Using our results, we study the statistical speed-up offered by cross validation compared to a train-test split procedure. We reveal some surprising behavior of the cross-validated risk and establish the statistically optimal choice for the number of folds. In the second part of this talk, we study the role of cross fitting in the generalized method of moments with moments that also depend on some auxiliary functions. Recent lines of work show how one can use generic machine learning estimators for these auxiliary problems, while maintaining asymptotic normality and root-n consistency of the target parameter of interest. The literature typically requires that these auxiliary problems are fitted on a separate sample or in a cross-fitting manner. We show that when these auxiliary estimation algorithms satisfy natural leave-one-out stability properties, then sample splitting is not required. This allows for sample re-use, which can be beneficial in moderately sized sample regimes
Joint work with Wenda Zhou (NYU and Flatiron), Jane Chen (Harvard) and Vasilis Syrgkanis (Stanford).
Thursday, June 15, 16:30 ~ 17:00
Spectral methods for clustering signed and directed networks and heterogeneous group synchronization
Mihai Cucuringu
University of Oxford, United Kingdom - This email address is being protected from spambots. You need JavaScript enabled to view it.
Graph clustering problems typically arise in settings where there exists a discrepancy in the edge density within different parts of the graph. In this work, we consider several problem instances where the underlying cluster structure arises as a consequence of a signal present on the edges or on the nodes of the graph, and is not driven by edge density.
We first consider the problem of clustering in two important families of networks: signed and directed, both relatively less well explored compared to their unsigned and undirected counterparts. Both problems share an important common feature: they can be solved by exploiting the spectrum of certain graph Laplacian matrices or derivations thereof. In signed networks, the edge weights between the nodes may take either positive or negative values, encoding a measure of similarity or dissimilarity. We consider a generalized eigenvalue problem involving graph Laplacians, and provide performance guarantees under the setting of a Signed Stochastic Block Model, along with regularized versions to handle very sparse graphs (below the connectivity threshold), a regime where standard spectral methods are known to underperform. We also propose a spectral clustering algorithm for directed graphs based on a complex-valued representation of the adjacency matrix, which is able to capture the underlying cluster structures, for which the information encoded in the direction of the edges is crucial. We evaluate the proposed algorithm in terms of a cut flow imbalance-based objective function, which, for a pair of given clusters, it captures the propensity of the edges to flow in a given direction. We analyze its theoretical performance on a Directed Stochastic Block Model for digraphs in which the cluster-structure is given not only by variations in edge densities, but also by the direction of the edges.
Finally, we discuss an extension of the classical angular synchronization problem that aims to recover unknown angles $\theta_1,\dots,\theta_n$ from a collection of noisy pairwise measurements of the form $(\theta_i - \theta_j) \mod 2\pi$, for each $\{i,j\} \in E$. We consider a generalization to the heterogeneous setting where there exist $k$ unknown groups of angles, and the measurement graph has an unknown edge-disjoint decomposition $G = G_1 \cup G_2 \ldots \cup G_k$, where the $G_i$'s denote the subgraphs of noisy edge measurements corresponding to each group. We propose a probabilistic generative model for this problem, along with a spectral algorithm for which we provide a detailed theoretical analysis in terms of robustness against sampling sparsity and noise.
Thursday, June 15, 17:00 ~ 17:30
Bilipschitz invariants
Dustin Mixon
The Ohio State University, USA - This email address is being protected from spambots. You need JavaScript enabled to view it.
Machine learning algorithms are designed for data in Euclidean space. When naively representing data in a Euclidean space $V$, there is often a nontrivial group $G$ of isometries such that different members of a common $G$-orbit represent the same data point. To properly model such data, we want to map the set $V/G$ of orbits into Euclidean space in a way that is bilipschitz in the quotient metric. In this talk, we show that $G$ needs to be pretty special for there to exist a polynomial invariant that is bilipschitz, and so we need to move beyond classical invariant theory to solve our problem. We find optimal bilipschitz embeddings in various settings, and we conclude with several open problems.
Joint work with Jameson Cahill (University of North Carolina Wilmington) and Joseph W. Iverson (Iowa State University).
Thursday, June 15, 17:30 ~ 18:00
Near-Optimal Bounds for Generalized Orthogonal Procrustes Problem via Generalized Power Method
Shuyang Ling
New York University Shanghai, China - This email address is being protected from spambots. You need JavaScript enabled to view it.
Given multiple point clouds, how to find the rigid transform (rotation, reflection, and shifting) such that these point clouds are well aligned? This problem, known as the generalized orthogonal Procrustes problem (GOPP), has found numerous applications in statistics, computer vision, and imaging science. While one commonly-used method is finding the least squares estimator, it is generally an NP-hard problem to obtain the least squares estimator exactly due to the notorious nonconvexity. In this work, we apply the semidefinite programming (SDP) relaxation and the generalized power method to solve this generalized orthogonal Procrustes problem. In particular, we assume the data are generated from a signal-plus-noise model: each observed point cloud is a noisy copy of the same unknown point cloud transformed by an unknown orthogonal matrix and also corrupted by additive Gaussian noise. We show that the generalized power method (equivalently alternating minimization algorithm) with spectral initialization converges to the unique global optimum to the SDP relaxation, provided that the signal-to-noise ratio is high. Moreover, this limiting point is exactly the least squares estimator and also the maximum likelihood estimator. Our theoretical bound is near-optimal in terms of the information-theoretic limit (only loose by a factor of the dimension and a log factor). Our results significantly improve the state-of-the-art results on the tightness of the SDP relaxation for the generalized orthogonal Procrustes problem, an open problem posed by Bandeira, Khoo, and Singer in 2014.
Friday, June 16, 14:30 ~ 15:00
The Monge Gap, A Regularizer to Learn Optimal Transport Maps
Marco Cuturi
Apple / CREST-ENSAE, France - This email address is being protected from spambots. You need JavaScript enabled to view it.
Optimal transport (OT) theory has been been used in machine learning to study and characterize maps that can push-forward efficiently a probability measure onto another. Recent works have drawn inspiration from Brenier's theorem, which states that when the ground cost is the squared-Euclidean distance, the ``best'' map to morph a continuous measure in $\mathcal{P}(\Rd)$ into another must be the gradient of a convex function. To exploit that result, Makkuva et al. 2020, Korotin et. al. 2020 consider maps $T=\nabla f_\theta$, where $f_\theta$ is an input convex neural network (ICNN), as defined by Amos et al. 2017, and fit $\theta$ with SGD using samples. % Despite their mathematical elegance, fitting OT maps with ICNNs raises many challenges, due notably to the many constraints imposed on $\theta$; the need to approximate the conjugate of $f_\theta$; or the limitation that they only work for the squared-Euclidean cost. More generally, we question the relevance of using Brenier's result, which only applies to densities, to constrain the architecture of candidate maps fitted on samples. Motivated by these limitations, we propose a radically different approach to estimating OT maps: Given a cost $c$ and a reference measure $\rho$, we introduce a regularizer, the Monge gap $\mathcal{M}^c_{\rho}(T)$ of a map $T$. That gap quantifies how far a map $T$ deviates from the ideal properties we expect from a $c$-OT map. In practice, we drop all architecture requirements for $T$ and simply minimize a distance (e.g., the Sinkhorn divergence) between $T\sharp\mu$ and $\nu$, regularized by $\mathcal{M}^c_\rho(T)$. We study $\mathcal{M}^c_{\rho}$, and show how our simple pipeline outperforms significantly other baselines in practice.
https://arxiv.org/abs/2302.04953
Joint work with Théo Uscidda (CREST-ENSAE).
Friday, June 16, 15:00 ~ 15:30
Shifted divergences for sampling, privacy, and beyond
Jason Altschuler
NYU / UPenn, USA - This email address is being protected from spambots. You need JavaScript enabled to view it.
Shifted divergences provide a principled way of making information theoretic divergences (e.g. KL) geometrically aware via optimal transport smoothing. In this talk, I will argue that shifted divergences provide a powerful approach towards unifying optimization, sampling, differential privacy, and beyond. For concreteness, I will demonstrate these connections via three recent highlights. (1) The fastest high-accuracy algorithm for sampling from log-concave distributions. (2) Resolving the mixing time of the Langevin Algorithm to its stationary distribution for log-concave sampling. (3) Resolving the differential privacy of Noisy-SGD, the standard algorithm for private optimization in both theory and practice. A recurring theme is a certain notion of algorithmic stability, and the central technique for establishing this is shifted divergences.
Joint work with Kunal Talwar (Apple) and Sinho Chewi (MIT).
Friday, June 16, 15:30 ~ 16:00
Bilevel optimization for machine learning
Pierre Ablin
Apple, France - This email address is being protected from spambots. You need JavaScript enabled to view it.
Bilevel optimization, the problem of minimizing a value function that involves the arg-minimum of another function, appears in many areas of machine learning of practical importance, like hyperparameter search or implicit deep learning.
This talk aims to describe machine learning problems in which bilevel optimization naturally appears, explain the differences with classical single-level optimization, and give an overview of recent algorithmic advances to solve the problem at scale.
Joint work with Mathieu Dagréou (Inria), Thomas Moreau (Inria), Samuel Vaiter (CNRS), Zaccharie Ramzi (ENS) and Gabriel Peyré (CNRS, ENS).
Friday, June 16, 16:30 ~ 17:30
Information theory through kernel methods
Francis Bach
Inria - Ecole Normale Supérieure, France - This email address is being protected from spambots. You need JavaScript enabled to view it.
Estimating and computing entropies of probability distributions are key computational tasks throughout data science. In many situations, the underlying distributions are only known through the expectation of some feature vectors, which has led to a series of works within kernel methods. In this talk, I will explore the particular situation where the feature vector is a rank-one positive definite matrix, and show how the associated expectations (a covariance matrix) can be used with information divergences from quantum information theory to draw direct links with the classical notions of Shannon entropies.
Saturday, June 17, 15:00 ~ 15:30
Neural Networks as Sparse Local Lipschitz Functions: Robustness and Generalization
Jeremias Sulam
Johns Hopkins University, United States - This email address is being protected from spambots. You need JavaScript enabled to view it.
The theoretical underpinnings of deep neural networks continue to be poorly understood, despite their overwhelming role in top-performing algorithms in a plethora of applications. In this talk we will explore the benefits of studying deep relu networks from the perspective of sparse local Lipschitz functions. We will first show how the characterization of the local neighborhoods where these functions are locally Lipschitz, while preserving a certain degree of sparsity, allows us to develop tighter robustness certificates against input perturbations. We will then show how to use similar tools to provide tighter non-uniform, and non-vacuous, generalization bounds that scale better with width.
Joint work with Ramchandran Muthukumar (Jonhs Hopkins University).
Saturday, June 17, 15:30 ~ 16:00
Understanding Deep Representation Learning via Neural Collapse
Qing Qu
University of Michigan, United States - This email address is being protected from spambots. You need JavaScript enabled to view it.
Recently, an intriguing phenomenon in the final stages of network training has been discovered and caught great interest, in which the last-layer features and classifiers collapse to simple but elegant mathematical structures: all training inputs are mapped to class-specific points in feature space, and the last-layer classifier converges to the dual of the features' class means while attaining the maximum possible margin. This phenomenon, dubbed Neural Collapse, persists across a variety of different network architectures, datasets, and even data domains. Moreover, a progressive neural collapse occurs from shallow to deep layers. This talk leverages the symmetry and geometry of Neural Collapse, and develops a rigorous mathematical theory to explain when and why it happens under the so-called unconstrained feature model. Based upon this, we show how it can be used to provide guidelines to understand and improve transferability with more efficient fine-tuning.
Saturday, June 17, 16:30 ~ 17:30
Robust regression revisited
Po-Ling Loh
Cambridge, United Kingdom - This email address is being protected from spambots. You need JavaScript enabled to view it.
This talk will discuss two projects on robust estimation in the presence of contaminated data, bringing new ideas beyond the framework of traditional M-estimation.
In the first part of the talk, we study the problem of linear regression in a setting where both the covariates and responses may be heavy-tailed and/or adversarially contaminated. We show how to modify the Huber regression estimator by first applying an appropriate "filtering" procedure to the data based on the covariates, and show that in low-dimensional settings, the filtered Huber regression estimator achieves near-optimal error rates. We further show that the commonly used least trimmed squares and least absolute deviation estimators may similarly be made robust to contaminated covariates via the same covariate filtering step.
In the second part of the talk, we study a variant of Newton's method for robust empirical risk minimization, where at each iteration of the optimization algorithm, we replace the gradient and Hessian of the objective function by robust estimators taken from literature on robust mean estimation for multivariate data. After proving a general theorem about the convergence of successive iterates to a small ball around the population-level minimizer, we study consequences of our theory in generalized linear models, when data are generated from Huber's epsilon-contamination model and/or heavy-tailed distributions. We also propose an algorithm for obtaining robust Newton directions based on the conjugate gradient method, which may be more appropriate for high-dimensional settings, and provide conjectures about the convergence of the resulting algorithm. Our algorithm enjoys the fast rates of convergence for successive iterates often achieved by second-order algorithms for convex problems, i.e., quadratic convergence in a neighborhood of the optimum, with a stepsize which may be chosen adaptively via backtracking linesearch.
Joint work with Ankit Pensia (IBM Research), Varun Jog (Cambridge), Eirini Ioannou (Edinburgh) and Muni Sreenivas Pydi (Paris Dauphine - PSL).
Posters
Naive imputation implicitly regularizes high-dimensional linear models
Ayme Alexis
LPSM, Sorbonne Université, France - This email address is being protected from spambots. You need JavaScript enabled to view it.
Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) used, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the $d\gg \sqrt n$ regime. Experiments illustrate our findings.
Joint work with Claire Boyer (LPSM, Sorbonne Université), Aymeric Dieuleveut (CMAP, Ecole Polytechnique), Erwan Scornet (CMAP, Ecole Polytechnique).
Exact Generalization Guarantees for (Regularized) Wasserstein Distributionally Robust Optimization
Waïss Azizian
LJK, Université Grenoble Alpes , France - This email address is being protected from spambots. You need JavaScript enabled to view it.
Wasserstein distributionally robust estimators have emerged as powerful models for modeling decision-making under uncertainty. These estimators provide a new kind of generalization guarantee: the robust objective w.r.t. the training distribution is, with high probability, an exact upper bound on the true risk. However, existing guarantees suffer from the curse of dimensionality in their dependence on the number of samples, are restricted to specific setting, or require a vanishing uncertainty radius and lead to spurious erreor terms. We show that the generalisation bounds still hold on general classes of models with correct dependency on the dimension, and furthermore allow for a non-vanishing radius to cover distribution shifts at testing. We also prove that these guarantees carry over to the newly introduced regularized versions of Wasserstein distributionally robust problems.
Joint work with Franck Iutzeler (LJK, Université Grenoble Alpes) and Jérôme Malick (LJK, CNRS, Université Grenoble Alpes).
Some theoretical properties of physics-informed neural networks
Nathan Doumèche
Sorbonne Université, France - This email address is being protected from spambots. You need JavaScript enabled to view it.
Physics-informed neural networks (PINNs) combine the expressiveness of neural networks with the interpretability of physical modeling. Their good practical performance has been demonstrated both in the context of solving partial differential equations and in the context of hybrid modeling, which consists of combining an imperfect physical model with noisy observations. However, most of their theoretical properties remain to be established. We show that classical training of these networks suffers from systematic overfitting. We then show that adding a ridge regularization to their empirical risk ensures the consistency of the resulting estimator. However, the convergence of PINNs to a solution that verifies the regularized physical constraint requires a more involved analysis , which mobilizes tools from functional analysis.
Joint work with Gérard Biau (Sorbonne Université) and Claire Boyer (Sorbonne Université).
Accelerating Stochastic Gradient Descent
Kanan Gupta
Texas A&M University, United States of America - This email address is being protected from spambots. You need JavaScript enabled to view it.
Momentum-based gradient descent methods use information gained along the trajectory, in addition to the local information from the gradient, in order to achieve an accelerated rate of convergence. These methods have been well-studied for convex optimization. Computing the gradient is often too expensive and it is approximated using stochastic gradient estimates in practice. However, there’s a lack of theoretical analyses of accelerated methods in the setting of stochastic gradient descent, even for the simple case of convex functions. We address this gap with a novel descent algorithm which provably achieves the optimal convergence rate for convex optimization. While the objective functions in deep learning training are non-convex, they share many properties with convex functions. Empirical results show that our algorithm outperforms the existing variants of stochastic gradient descent with momentum for training of neural networks.
Joint work with Jonathan Siegel (Texas A&M University) and Stephan Wojtowytsch (Texas A&M University).
Langevin Quasi-Monte Carlo
Sifan Liu
Stanford University, USA - This email address is being protected from spambots. You need JavaScript enabled to view it.
Langevin Monte Carlo (LMC) and its stochastic gradient versions are powerful algorithms for sampling from complex high-dimensional distributions, which are common in data science and machine learning applications. To sample from the distribution with density $\pi(x)\propto \exp(-U(x)) $, LMC generates the next sample by taking a step in the gradient direction $\nabla U$ with a Gaussian perturbation. Expectations w.r.t. the target distribution $\pi$ are estimated by averaging over LMC samples. In ordinary Monte Carlo, it is well known that the estimation error can be substantially reduced by replacing independent random samples with quasi-random samples like low-discrepancy sequences. In this work, we show that the estimation error of LMC can also be reduced by using quasi-random samples. Specifically, we propose to use completely uniformly distributed sequences with certain low-discrepancy property to generate the Gaussian perturbations (and stochastic gradients). Under smoothness and convexity conditions, we prove that LMC with quasi-random samples achieves smaller errors than standard LMC. We provide rigorous theoretical analysis supported by compelling numerical experiments to demonstrate the effectiveness of our approach.
Joint work with Art Owen (Stanford University).
Two-timescale regime for global convergence of neural networks
Pierre Marion
Sorbonne Université, France - This email address is being protected from spambots. You need JavaScript enabled to view it.
We study the training dynamics of shallow neural networks, in a two-timescale regime in which the stepsizes for the inner layer are much smaller than those for the outer layer. In this regime, we prove global convergence of the gradient flow to the optimum of the non-convex optimization problem in a simple univariate setting. The number of neurons need not be asymptotically large for our result to hold, distinguishing our result from popular recent approaches such as the neural tangent kernel or mean-field regimes. Experimental illustration is provided, showing that the stochastic gradient descent behaves according to our description of the gradient flow and thus converges to the global minimum in the two-timescale regime, but can fail outside of this regime.
Joint work with Raphaël Berthier (EPFL, Switzerland).
Near-optimal learning of Banach-valued, high-dimensional functions via deep neural networks
Sebastian Moraga Scheuermann
Simon Fraser University, Canada - This email address is being protected from spambots. You need JavaScript enabled to view it.
The past decade has seen increasing interest in applying Deep Learning (DL) to Computational Science and Engineering (CSE). Driven by impressive results in applications such as computer vision, Uncertainty Quantification (UQ), genetics, simulations and image processing, DL is increasingly supplanting classical algorithms, and seems poised to revolutionize scientific computing. However, DL is not yet well-understood from the standpoint of numerical analysis. Little is known about the efficiency and reliability of DL from the perspectives of stability, robustness, accuracy, and sample complexity. In particular, approximating solutions to parametric PDEs is an objective of UQ for CSE. Training data for such problems is often scarce and corrupted by errors. Moreover, the target function is a possibly infinite-dimensional smooth function taking values in the PDE solution space, generally an infinite-dimensional Banach space. This work provides arguments for Deep Neural Network (DNN) approximation of such functions, with both known and unknown parametric dependence, that overcome the curse of dimensionality. We establish \textit{practical existence theorems} that describe classes of DNNs with dimension-independent architecture size and training procedures based on minimizing the (regularized) $\ell^2$-loss which achieve near-optimal algebraic rates of convergence. These results involve key extensions of compressed sensing for Banach-valued recovery and polynomial emulation with DNNs. When approximating solutions of parametric PDEs, our results account for all sources of error, i.e., sampling, optimization, approximation and physical discretization, and allow for training high-fidelity DNN approximations from coarse-grained sample data. Our theoretical results fall into the category of non-intrusive methods, providing a theoretical alternative to classical methods for high-dimensional approximation.
Joint work with Ben Adcock (Simon Fraser University), Simone Brugiapaglia (Concordia University) and Nick Dexter (Florida State University).
Trimmed sample means for robust uniform mean estimation and regression
Lucas Resende
IMPA (Instituto de Matemática Pura e Aplicada), Brazil - This email address is being protected from spambots. You need JavaScript enabled to view it.
It is well-known that trimmed sample means are robust against heavy tails and data contamination. This poster presents the results in [1], where Oliveira and I analyzed the performance of trimmed means and related methods in two novel contexts. The first one consists of estimating expectations of functions in a given family, with uniform error bounds; this is closely related to the problem of estimating the mean of a random vector under a general norm. The second problem considered is that of regression with quadratic loss. In both cases, trimmed-mean-based estimators are the first to obtain optimal dependence on the (adversarial) contamination level. Moreover, they also match or improve upon the state of the art in terms of heavy tails. Experiments with synthetic data show that a natural "trimmed mean linear regression'' method often performs better than both ordinary least squares and alternative methods based on median-of-means.
[1] Oliveira, Roberto I. and Lucas Resende. “Trimmed sample means for robust uniform mean estimation and regression.” (2023).
Joint work with Roberto Imbuzeiro Oliveira (IMPA).
Convergence rates for Lipschitz learning on very sparse graphs
Tim Roith
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany - This email address is being protected from spambots. You need JavaScript enabled to view it.
This poster is devoted to proving continuum limits of Lipschitz Learning, a graph-based semi-supervised learning method. Given a partial labeling on a finite data set equipped with a graph structure, the goal is to extend the labels to the whole data set without increasing their Lipschitz constant. In particular, we are interested in the continuum limit of this problem when the number of data points tends to infinity. This is first done via $\Gamma$-convergence of the graph Lipschitz constant functional to the functional $u\mapsto\|\nabla u\|_\infty$. The corresponding minimization problem is a Lipschitz extension task which does not admit a unique solution. We then consider absolutely minimizing Lipschitz extensions (AMLEs) which are the unique solution of the infinity Laplace equation. We demonstrate that quantitative convergence rates of AMLEs on graphs to those in the continuum can be derived from a certain ratio convergence property of the graph distance function. This is then applied to arbitrary geometric graphs as well as to homogeneous Poisson point processes. In both cases, our techniques yield convergence rates for very sparsely connected graphs.
Joint work with Leon Bungert (Technische Universität Berlin, Germany) and Jeff Calder (University of Minnesota, United States).
Regularization properties of dropout
Anna Shalova
Eindhoven University of Technology, Netherlands - This email address is being protected from spambots. You need JavaScript enabled to view it.
Generalization is a crucial aspect of training algorithms in machine learning. Dropout training has been empirically shown to improve generalization of different models including neural networks and generalized linear models. In this work, we give a theoretical explanation of this phenomenon. We introduce a time-continuous analog of dropout gradient descent called Ornstein-Uhlenbeck dropout and study its behavior in the small noise limit. We obtain an effective limit model, in which the regularization term induced by dropout is explicit.
Joint work with Mark Peletier (Eindhoven University of Technology, Netherlands) and André Schlichting (WWU Münster, Germany).
Identifiability of Gaussian Mixtures from their sixth moments
Alexander Taveira Blomenhofer
CWI, Amsterdam, The Netherlands - This email address is being protected from spambots. You need JavaScript enabled to view it.
I give a partial answer to an identifiability question from algebraic statistics: When do the truncated moments of a Gaussian mixture distribution uniquely determine the parameters (means and covariance matrices)? We will see that generically, a mixture of at most $m=\theta(n^4)$ Gaussian distributions in n variables is uniquely determined by its moments of degree 6. The proof relies on recent advances in both the theory of secant varieties and on Fröberg's conjecture, which jointly allow to reduce identifiability to a simple combinatorial problem. I show that the resulting Gaussian Moment variety of degree 6 is identifiable up to rank $m=\theta(n^4)$, which asymptotically matches the bound from counting parameters, with the constant hidden in the O-notation being optimal.
Conformal Prediction with Missing Values
Margaux Zaffran
INRIA and Ecole Polytechnique, France - This email address is being protected from spambots. You need JavaScript enabled to view it.
Conformal prediction is a theoretically grounded framework for constructing predictive intervals. We study conformal prediction with missing values in the covariates -- a setting that brings new challenges to uncertainty quantification. We first show that the marginal coverage guarantee of conformal prediction holds on imputed data for any missingness distribution and almost all imputation functions. However, we emphasize that the average coverage varies depending on the pattern of missing values: conformal methods tend to construct prediction intervals that under-cover the response conditionally to some missing patterns. This motivates our novel generalized conformalized quantile regression framework, missing data augmentation, which yields prediction intervals that are valid conditionally to the patterns of missing values, despite their exponential number. We then show that a universally consistent quantile regression algorithm trained on the imputed data is Bayes optimal for the pinball risk, thus achieving valid coverage conditionally to any given data point. Moreover, we examine the case of a linear model, which demonstrates the importance of our proposal in overcoming the heteroskedasticity induced by missing values. Using synthetic and real data from critical care, we corroborate our theory and report improved performance of our methods.
Joint work with Aymeric Dieuleveut (Ecole Polytechnique, France), Julie Josse (INRIA, France) and Yaniv Romano (Technion - Israel Institute of Technology, Israel).