Robustness of machine learning models to various adversarial and non-adversarial corruptions continues to be of interest. In this paper, we introduce the notion of the boundary thickness of a classifier, and we describe its connection with and usefulness for model robustness. Thick decision boundaries lead to improved performance, while thin decision boundaries lead to overfitting (e.g., measured by the robust generalization gap between training and testing) and lower robustness. We show that a thicker boundary helps improve robustness against adversarial examples (e.g., improving the robust test accuracy of adversarial training), as well as so-called out-of-distribution (OOD) transforms, and we show that many commonly-used regularization and data augmentation procedures can increase boundary thickness. On the theoretical side, we establish that maximizing more…

**Authors:** Yaoqing Yang, Rajiv Khanna, Yaodong Yu, Amir Gholaminejad, Kurt Keutzer, Joseph Gonzalez, Kannan Ramchandran, Michael Mahoney

Large batch size training of Neural Networks has been shown to incur accuracy loss when trained with the current methods. The exact underlying reasons for this are still not completely understood. Here, we study large batch size training through the lens of the Hessian operator and robust optimization. In particular, we perform a Hessian based study to analyze exactly how the landscape of the loss function changes when training with large batch size. We compute the true Hes- sian spectrum, without approximation, by back-propagating the second derivative. Extensive experiments on multiple networks show that saddle-points are not the cause for generalization gap of large batch size training, and the results consistently show that large batch converges to points with noticeably more…

**Authors:** Zhewei Yao, Amir Gholaminejad, Michael Mahoney

We release new benchmarks in the use of deep reinforcement learning (RL) to create controllers for mixed-autonomy traffic, where connected and autonomous vehicles (CAVs) interact with human drivers and infrastructure. Benchmarks, such as Mujoco or the Arcade Learning Environment, have spurred new research by enabling researchers to effectively compare their results so that they can focus on algorithmic improvements and control techniques rather than system design. To promote similar advances in traffic control via RL, we propose four benchmarks, based on three new traffic scenarios, illustrating distinct reinforcement learning problems with applications to mixed-autonomy traffic. We provide an introduction to each control problem, an overview of their MDP structures, and preliminary performance results from commonly used RL algorithms. For the purpose more…

**Authors:** Eugene Vinitsky, Aboudy Kriedieh, Luc Le Flem, Nishant Kheterpal, Kathy Jang, Cathy Wu, Richard Liaw, Eric Liang, Alexandre Bayen

Parametrized Hierarchical Procedures (PHP) represent a program as a hierarchy of procedures that call each other, each implemented by a neural network. We develop an algorithm for training PHPs from a set of supervisor demonstrations, only few of which are annotated with the internal call structure.

**Authors:** Roy Fox, Eui Chul (Richard) Shin, Sanjay Krishnan, Kenneth Goldberg, Dawn song, Ion Stoica

DDCO learns form demonstrations continuous control skills parametrized by deep networks, their termination condition, and high-level policies to use them.

**Authors:** Sanjay Krishnan, Roy Fox, Ion Stoica, Kenneth Goldberg

In this paper we study principal components analysis in the regime of high dimensionality and high noise. Our model of the problem is a rank-one deformation of a Wigner matrix where the signal-to-noise ratio (SNR) is of constant order, and we are interested in the fundamental limits of detection of the spike. Our main goal is to gain a fine understanding of the asymptotics for the log-likelihood ratio process, also known as the free energy, as a function of the SNR. Our main results are twofold. We first prove that the free energy has a finite-size correction to its limit—the replica-symmetric formula—which we explicitly compute. This provides a formula for the Kullback-Leibler divergence between the planted and null models. Second, more…

**Authors:** Ahmed El Alaoui, Florent Krzakala, Michael Jordan

Standard methods for multi-variate time series analysis are hampered by sampling at random timestamps, long range dependencies , and the scale of the data. In this paper we present a novel estimator for cross-covariance of randomly observed time series which identifies the dynamics of an unobserved stochastic process. We analyze the statistical properties of our estimator without the assumption that observation timestamps are independent from the process of interest and show that our solution does not suffer from the corresponding issues affecting standard estimators for cross-covariance. We implement and evaluate our statistically sound and scalable approach in the distributed setting using Apache Spark and demonstrate its ability to identify interactions between processes on simulations and financial data with tens of millions of samples. Pdf: Aistats_camera_ready

**Authors:** 66, Joseph Gonzalez, Evan Sparks, Alexandre M. Bayen

We consider the problem of decoding a discrete signal of categorical variables from the observation of several histograms of pooled subsets of it. We present an Approximate Message Passing (AMP) algorithm for recovering the signal in the random dense setting where each observed histogram involves a random subset of entries of size proportional to n. We characterize the performance of the algorithm in the asymptotic regime where the number of observations m tends to infinity proportionally to n, by deriving the corresponding State Evolution (SE) equations and studying their dynamics. We initiate the analysis of the multi-dimensional SE dynamics by proving their convergence to a fixed point, along with some further properties of the iterates. The analysis reveals sharp phase more…

**Authors:** Aaditya Ramdas, Ahmed El Alaoui, Michael Jordan, Florent Krzakala, Lenka Zdeborova

Consider a population consisting of n individuals, each of whom has one of d types (e.g. their blood type, in which case d=4). We are allowed to query this database by specifying a subset of the population, and in response we observe a noiseless histogram (a d-dimensional vector of counts) of types of the pooled individuals. This measurement model arises in practical situations such as pooling of genetic data and may also be motivated by privacy considerations. We are interested in the number of queries one needs to unambiguously determine the type of each individual. In this paper, we study this information-theoretic question under the random, dense setting where in each query, a random subset of individuals of size proportional more…

**Authors:** Ahmed El Alaoui, Aaditya Ramdas, Michael Jordan, Florent Krzakala, Lenka Zdeborova

Kernel methods provide an attractive framework for aggregating and learning from ranking data, and so understanding the fundamental properties of kernels over permutations is a question of broad interest. We provide a detailed analysis of the Fourier spectra of the standard Kendall and Mallows kernels, and a new class of polynomial-type kernels. We prove that the Kendall kernel has exactly two irreducible representations at which the Fourier transform is non-zero, and moreover, the associated matrices are rank one. This implies that the Kendall kernel is nearly degenerate, with limited expressive and discriminative power. In sharp contrast, we prove that the Fourier transform of the Mallows kernel is a strictly positive definite matrix at all irreducible representations. This property guarantees that more…

**Authors:** Horia Mania, Aaditya Ramdas, Martin J. Wainwright, Michael Jordan, Benjamin Recht