### Monday, May 1st, 2017

How to combine the complementary strengths of probabilistic graphical models and neural networks? We compose latent graphical models with neural network observation likelihoods. For inference, we use recognition networks to produce local evidence potentials, then combine them using efficient message-passing algorithms. All components are trained simultaneously with a single stochastic variational inference objective. We use this framework to automatically segment and categorize mouse behavior from raw depth video.

For scientific applications involving large and complex data, it is useful to develop probability models and inference methods but issues arise in terms of computing speed and stability. In such settings accurate uncertainty quantification is critical and common fast approximations often do a poor job of UQ. In this talk, I discuss recent developments in scaling up sampling algorithms to big problems - considering broad classes of approximate and parallelizable MCMC algorithms with corresponding theory guarantees. A particular focus will be on data augmentation approaches. A variety of applications will be provided as motivation.

Many empirical questions about machine learning systems are best answered through the eigenvalues of associated matrices. For example, the Hessian of the loss function for a large deep network gives us insight into the difficulty of optimization and sensitivity to parameters. The spectra of adjacency and Laplacian matrices of large graphs helps us understand the global structure between vertices. Unfortunately, in the most interesting situations, these matrices are too large to be explicitly instantiated, to say nothing of diagonalizingthem directly; rather they are implicit matrices, which can only be interrogatedvia matrix-vector products. In this work I will discuss how several differentrandomized estimation tricks can be assembled to construct unbiased estimatorsof the spectra of large implicit matrices via generalized traces.

### Tuesday, May 2nd, 2017

### Wednesday, May 3rd, 2017

We present a model for clustering which combines two criteria: Given a collection of objects with pairwise similarity measure, the problem is to find a cluster that is as dissimilar as possible from the complement, while having as much similarity as possible within the cluster. The two objectives are combined either as a ratio or with linear weights. The ratio problem, and its linear weighted version, are solved by a combinatorial algorithm within the complexity of a single minimum s,t-cut algorithm. This problem (HNC) is closely related to the NP-hard problem of normalized cut that is often used in image segmentation and for which heuristic solutions are generated with the eigenvector technique (spectral method).

HNC has been used, as a supervised or unsupervised machine learning technique. In an extensive comparative study HNC was compared to leading machine learning techniques on benchmark datasets. The study indicated that HNC and other similarity-based algorithms provide robust performance and improved accuracy as compared to other techniques. Furthermore, the technique of "sparse-computation" enables the scalability of HNC and similarity-based algorithms to large scale data sets.

The development of algorithms for hierarchical clustering has been hampered by a shortage of precise objective functions. To help address this situation, we introduce a simple cost function on hierarchies over a set of points, given pairwise similarities between those points. We show that this criterion behaves sensibly in canonical instances and that it admits a top-down construction procedure with a provably good approximation ratio.

We show, moreover, that this procedure lends itself naturally to an interactive setting in which the user is repeatedly shown snapshots of the hierarchy and makes corrections to these.

Structured prediction is the task of jointly predicting correlated outputs; the core computational challenge is how to do so in time that is not exponential in the number of outputs without making assumptions on the correlation structure (such as linear chain assumptions). I will describe recent work we have done in developing novel algorithms based on an imitation learning view of the structured prediction problem, whose core goal is efficiency. I'll provide a unified analysis of a large family of approaches. I'll conclude with some recent work connecting these ideas to sequential models in deep learning, such as recurrent neural networks, and how to modify these approaches for a bandit feedback setting.

In Learning Reductions, you reduce to a simpler machine learning problem, apply a solver for that problem, and then use the solution on the simpler problem to solve a more complex problem. Learning reductions have been used to create exponentially more efficient solutions to Multiclass classification, Contextual Bandit learning and exploration, and Active Learning. I will discuss several families of learning reductions, their applications, and limits.

Taking place at Banatao Auditorium, Sutardja Dai Hall

Can machines think? Philosophy and science have long explored this question. Throughout the 20th century, attempts were made to link this question to the latest discoveries -- Goedel's theorem, Quantum Mechanics, undecidability, computational complexity, cryptography etc. Starting in the 1980s, a long body of work led to the conclusion that many interesting approaches—even modest ones—towards achieving AI were computationally intractable, meaning NP-hard or similar. One could interpret this body of work as a "complexity argument against AI."

But in recent years, empirical discoveries have undermined this argument, as computational tasks hitherto considered intractable turn out to be easily solvable on very large-scale instances. Deep learning is perhaps the most famous example.

This talk revisits the above-mentioned complexity argument against AI and explains why it may not be an obstacle in reality. We survey methods used in recent years to design provably efficient (polynomial-time) algorithms for a host of intractable machine learning problems under realistic assumptions on the input. Some of these can be seen as algorithms to extract semantics or meaning out of data.

Light refreshments will be served before the lecture at 3:30 p.m.

### Thursday, May 4th, 2017

Many big data analytics problems are intrinsically complex and hard, making the design of effective and scalable algorithms very challenging. Domain experts need to perform extensive research, and experiment with many trial-and-errors, in order to craft approximation or heuristic schemes that meet the dual goals of effectiveness and scalability. Very often, restricted assumptions about the data, which are likely to be violated in real world, are made in order for the algorithms to work and obtain performance guarantees. Furthermore, previous algorithm design paradigms seldom systematically exploit a common trait of real-world problems: instances of the same type of problem are solved repeatedly on a regular basis, differing only in their data. Is there a better way to design effective and scalable algorithms for big data analytics?

I will introduce a framework for addressing this challenge based on the idea of embedding algorithm steps into nonlinear spaces, and learn these embedded algorithms from problem instances via either direct supervision or reinforcement learning. In contrast to traditional algorithm design where every steps in an algorithm is prescribed by experts, the embedding design will delegate some difficult algorithm choices to nonlinear learning models so as to avoid either large memory requirement, restricted assumptions on the data, or limited design space exploration. I will illustrate the benefit of this new design framework using large scale real world data, including a materials discovery problem, a recommendation problem over dynamic information networks, and a problem of learning combinatorial algorithms over graphs. The learned algorithms can reduce memory usage and runtime by orders of magnitude, and sometimes result in drastic improvement in predictive performance.

We consider the fundamental statistical problem of robustly estimating the mean and covariance of a distribution from i.i.d. samples in n-dimensional space, i.e. in the presence of arbitrary (malicious) noise, with no assumption on the noise. Classical solutions (Tukey point, geometric median) are either intractable to compute efficiently or produce estimates whose error scales as a power of the dimension. In this talk, we present efficient and practical algorithms to compute estimates for the mean, covariance and operator norm of the covariance matrix, for a wide class of distributions. We show that the estimate of the algorithm is higher than the information-theoretic lower bound by a factor of at most the square-root of the logarithm of the dimension. This gives polynomial-time solutions to some basic questions studied in robust statistics. One of the applications is agnostic SVD.

No abstract available.

I will talk about the polytope learning problem: Given random samples from a polytope in the n-dimensional space with poly(n) vertices, can we efficiently find an approximation to the polytope? (Information theoretically, poly(n) many samples suffice.) I will discuss some approaches and related conjectures for this problem, and also its relation to other problems in learning theory.

### Friday, May 5th, 2017

We consider the problems of estimation, learning, and optimization over a large dataset of which a subset consists of points drawn from the distribution of interest, and we make no assumptions on the remaining points---they could be well-behaved, extremely biased, of adversarially generated. We investigate this question via two models for studying robust estimation, learning, and optimization. One of these models, which we term the "semi-verified" model, assumes access to a second much smaller (typically constant-sized) set of "verified" or trusted data points that have been drawn from the distribution of interest. The key question in this model is how a tiny, but trusted dataset can allow for the accurate extraction of the information contained in the large, but untrusted dataset. The second model, "list-decodable learning", considers the task of returning a small list of proposed answers. Underlying this model is the question of whether the structure of "good" datapoints can be overwhelmed by the remaining data--surprisingly, the answer is often ``no''. We present several strong algorithmic results for these models, for a large class of mathematically clean and practically relevant robust estimation and learning tasks.

The talk is based on several joint works with Jacob Steinhardt and Moses Charikar, and with Michela Meister.

In the age of big data, communication is sometimes more of a bottleneck than computation. The main model of distributed computation used for machine learning for big data is the map-reduce model which gives the programmer a transparent way to scale up computation to a large number of processors. One drawback of the map-reduce model is its reliance on boundary synchronization and on shuffling. Both of these mechanisms create a communication bottleneck which becomes more severe the more computers are added to the cluster. In this talk I propose a new model of computation which I call “tell me something new”. TMSN is an asynchronous model of computation which optimizes communication. Rather than communicating in times dictated by a common clock, each processor broadcasts information when they have something important to say. I show how this TMSN can be used to distribute boosting, K-means++ and stochastic gradient descent. I will also show that TMSN is a natural fit for adaptive sensor networks.

### Monday, September 11th, 2017

Dependent rounding, originating from the "pipage rounding" work of Ageev and Sviridenko, is now a fundamental algorithmic approach in combinatorial optimization. We survey this area, with an emphasis on the types of correlation inequalities that are achievable/desirable, and applications thereof. Topics covered include connections with matroids, matroid intersection, and scheduling, as well as how to handle situations where some amount of positive correlation is inevitable: the last part is recent joint work with David Harris, Thomas Pensyl, and Khoa Trinh, and arises in facility-location problems.