## August 28, 2019

**Special Guest Lecture:** *Interpretable Machine Learning: Optimal Decision Trees and Optimal Scoring Systems*

**Location:** SAMSI Classroom

**Speaker:** Cynthia Rudin, Prof. of Computer Science, Electrical and Computer Engineering, and Statistical Science, Duke University and Assoc. Director, SAMSI

## Bio

Cynthia Rudin is a professor of computer science, electrical and computer engineering, and statistical science at Duke University, and directs the Prediction Analysis Lab, whose main focus is in interpretable machine learning. Previously, Prof. Rudin held positions at MIT, Columbia, and NYU. She holds an undergraduate degree from the University at Buffalo, and a PhD from Princeton University. She is a three time winner of the INFORMS Innovative Applications in Analytics Award, was named as one of the “Top 40 Under 40” by Poets and Quants in 2015, and was named by Businessinsider.com as one of the 12 most impressive professors at MIT in 2015. She is past chair of both the INFORMS Data Mining Section and the Statistical Learning and Data Science section of the American Statistical Association. She has also served on committees for DARPA, the National Institute of Justice, and AAAI. She has served on three committees for the National Academies of Sciences, Engineering and Medicine, including the Committee on Applied and Theoretical Statistics, the Committee on Law and Justice, and the Committee on Analytic Research Foundations for the Next-Generation Electric Grid. She is a fellow of the American Statistical Association and a fellow of the Institute of Mathematical Statistics. She will be the Thomas Langford Lecturer at Duke University during the 2019-2020 academic year.

## Abstract

How do patients and doctors know that they can trust predictions from a model that they cannot understand? Transparency in machine learning models is critical in high stakes decisions, like those made every day in healthcare. My lab creates machine learning algorithms for predictive models that are interpretable to human experts. I will focus on two historical hard optimization problems whose solutions are important in practice:

(1) Optimal sparse decision trees and optimal sparse rule list models. Our algorithms are highly customized branch and bound procedures. These are an alternative to CART and other greedy decision tree methods. The solutions are globally optimal according to accuracy, regularized by the number of leaves (sparsity). This problem is NP-hard with no polynomial time approximation. I will present the first practical algorithms for this problem.

(2) Optimal scoring systems. Scoring systems are sparse linear models with integer coefficients. Traditionally, scoring systems have been designed using manual feature elimination on logistic regression models, with a post-processing step where coefficients have been rounded. However, this process does not produce optimal solutions. I will present a novel cutting plane method for producing scoring systems from data. The solutions are globally optimal according to the logistic loss, regularized by the number of terms (sparsity), with coefficients constrained to be integers.

These algorithms have been used for many medical applications and criminal justice applications.

*Work with Margo Seltzer and Berk Ustun, as well as Elaine Angelino, Nicolas Larus-Stone, Daniel Alabi, Sean Hu, and Jimmy Lin.*

## References

*Certifiably Optimal Rule Lists*

https://xxx.arxiv.org/pdf/1704.01701.pdf

JMLR 2018 & KDD 2017*Learning Risk Scores from Large-Scale Datasets*

http://web.mit.edu/ustunb/www/docs/OptimizedRiskScores.pdf

JMLR 2019 (accepted) & KDD 2017*Optimal Sparse Decision Trees*

https://arxiv.org/abs/1904.12847

**Presentation Slides**

**Video**

## September 4, 2019

**Special Guest Lecture:** *There is a Kernel Method for That*

**Location:** SAMSI Classroom

**Speaker:** Ernest Fokoue, Professor of Statistics, Rochester Institute of Technology

## Bio

Ernest Fokoué is Professor of Statistics in the School of Mathematical Sciences at Rochester Institute of Technology. He enjoys the honor of being the primogenito of SAMSI postdoctoral fellows that got the institute going with the Data Mining and Machine Learning (DMML) program in 2003. He is one of the co-leaders of the 2019-2020 SAMSI Games, Decisions, Risk and Reliability (GDRR) program, and will be spending his whole sabbatical year contributing to its activities. His areas of research and teaching interests are Bayesian Statistics, Statistical Machine Learning, Computational Statistics, Epistemology, Theology and Linguistics.

## Abstract

In this lecture, I will present a general tour of some of the most commonly used kernel methods in statistical machine learning and data mining. I will touch on elements of artificial neural networks and then highlight their intricate connections to some general purpose kernel methods like Gaussian process learning machines. I will also resurrect the famous universal approximation theorem and will most likely ignite a [controversial] debate around the theme: could it be that [shallow] networks like radial basis function networks or Gaussian processes are all we need for well-behaved functions? Do we really need many hidden layers as the hype around Deep Neural Network architectures seem to suggest or should we heed Ockham’s principle of parsimony, namely “Entities should not be multiplied beyond necessity.” (“Entia non sunt multiplicanda praeter necessitatem.”) I intend to spend the last 15 minutes of this lecture sharing my personal tips and suggestions with our precious postdoctoral fellows on how to make the most of their experience.

**Presentation Slides**

**Video**

## September 11, 2019

**Special Guest Lecture:** *Attacking the Curse of Dimensionality using Sums of Separable Functions*

**Location:** SAMSI Classroom

**Speaker:** Martin Mohlenkamp, Assoc. Professor, Dept. of Mathematics, Ohio University

## Abstract

Naive computations involving a function of many variables suffer from the curse of dimensionality: the computational cost grows exponentially with the number of variables. One approach to bypassing the curse is to approximate the function as a sum of products of functions of one variable and compute in this format. When the variables are indices, a function of many variables is called a tensor, and this approach is to approximate and use the tensor in the (so-called) canonical tensor format. In this talk I will describe how such approximations can be used in numerical analysis and in machine learning.

**Presentation Slides**

**Video**

## September 18, 2019

*Lecture: Multifidelity Computer Model Emulation with High-Dimensional Output*

**Location:** SAMSI Classroom

**Speaker:** Pulong Ma, Second-Year SAMSI Postdoctoral Fellow

## Abstract

Hurricane-driven storm surge is one of the most deadly and costly natural disasters, making precise quantification of the surge hazard of great importance. Physics-based computer models of storm surge can be implemented with a wide range of fidelity due to the nature of the system, though the danger posed by surge makes greater fidelity highly desirable. However, such models and their high-dimensional outputs tend to come at great computational cost, which can make highly detailed studies prohibitive. These needs make the development of an emulator combining high-dimensional output from multiple complex computer models with different fidelity levels important. We propose a parallel partial autoregressive cokriging model that is able to address these issues. Based upon the data-augmentation technique, model parameters are estimated via Monte Carlo expectation-maximization algorithm and prediction is made in a computationally efficient way when input designs across different fidelity levels are not nested. With this methodology, the high-fidelity storm surges can be generated much more quickly in coastal flood studies, and hence can facilitate the risk assessment of storm surge hazards.

## References

No references provided at this time

## September 25, 2019

*Lecture: Analyzing Collective Motion with Machine Learning and Topology*

**Location:** SAMSI Classroom

**Speaker:** John Nardini, Second-Year SAMSI Postdoctoral Fellow

## Abstract

We use topological data analysis and machine learning to study a seminal model of collective motion in biology. This model describes agents interacting nonlinearly via attractive-repulsive social forces and gives rise to collective behaviors such as flocking and milling. To classify the emergent collective motion in a large library of numerical simulations and to recover model parameters from the simulation data, we apply machine learning techniques to two different types of input. First, we input time series of order parameters traditionally used in studies of collective motion. Second, we input measures based in topology that summarize the time-varying persistent homology of simulation data over multiple scales. This topological approach does not require prior knowledge of the expected patterns. For both unsupervised and supervised machine learning methods, the topological approach outperforms the traditional one.

## References

No references provided at this time

## October 2, 2019

*Lecture: On the Trilogy of Nonparametric Methods: Models, Inference, Misspecification*

**Location:** SAMSI Classroom

**Speaker:** Wenjia Wang, Second-Year SAMSI Postdoctoral Fellow

## Abstract

Non-parametric methods provide a flexible framework to understand the relationship between the input and output of complex systems. Despite the successful application of nonparametric methods, many of them do not have a clear characterization of their behavior from a theoretical point of view. My talk has two parts. In Part I, I will present recent works on some error bounds for the Gaussian process modeling, which is a standard non-parametric method used in computer experiments. In Part II, I will provide a multi-resolution functional ANOVA model which can work for large-scale and many-input problems.

## References

No references provided at this time

## October 9, 2019

*Lecture: Representation Learning via Disentangled Variational Autoencoders*

**Location:** SAMSI Classroom

**Speaker:** Matthias Sachs, SAMSI Postdoctoral Fellow and Duke Researcher

## Abstract

In this talk I will present some insight into representation learning in the context of the variational autoencoder framework which I gained during a SAMSI-industry collaboration with a biostatistics group at the Bayer corporation. I will briefly explain the general idea behind a Variational Autoencoder (VAE) motivating the construction of a VAE by the problem of parameterizing a complex generative model and continue with the discussion of VAE approaches aiming at a disentangled code representation (i.e., a code representation whose components correspond to independent factors of the parametrized generative model). I will close the talk by presenting results obtained by applying the discussed techniques to patients’ monitor data in the context of the industry collaboration.

## References

No references provided at this time

## October 16, 2019

*Lecture: (1) Drivers Learn City-scale Dynamic Equilibrium (2) Probability Estimation on Manifolds via Diffusion*

**Location:** SAMSI Classroom

**Speaker:** Ruda Zhang, First-Year SAMSI Postdoctoral Fellow

## Abstract

The first part is to prepare for a talk at INFORMS 2020 next week. This paper studies the taxi industry as a game of multi-market competition among firms of equal capacity, where taxi drivers allocate service time across the street network to maximize income. We prove that the game has a Nash equilibrium, which is symmetric, essentially unique, and globally asymptotically stable under gradient adjustment process and imitative learning. With 2009-2013 trip records of New York City yellow cabs, we validate that taxi drivers’ behavior conforms to our prediction, and that drivers learn the equilibrium strategy over time.

The second part proposes a method for approximating probability distributions on manifolds. Recent advances in statistics and machine learning have exploited the low-dimensional manifold structure of high-dimensional data sets. Being able to estimate probability distributions on such manifolds allows for generative models (i.e. sampling) and statistical inference. With heat diffusion as a unifying concept for density estimation on Euclidean spaces and non-trivial manifolds, our paper solves this task using an approximate Neumann heat operator.

## References

No references provided at this time

## October 23, 2019

*Lecture: Network Analysis for Microbiome Data*

**Location:** SAMSI Classroom

**Speaker:** Xinyi Li, Second-Year SAMSI Postdoctoral Fellow

## Abstract

In this talk, I will first give a brief introduction of microbiomes, which play a central role in many biological process. The interaction networks between the species of microorganisms often attracts a lot of attention from researchers. I will then give an overview of the methods that are widely used to infer the network. In addition, I will introduce our proposed method, which accounts for the characteristics of microbiome data, and evaluate its performance in simulation studies.

## References

No references provided at this time

## October 30, 2019

*Lecture: To be determined*

**Location:** SAMSI Classroom

**Speaker:** Bianca Dumitrascu, First-Year SAMSI Postdoctoral Fellow

## Abstract

To be determined

## References

No references provided at this time

## November 6, 2019

*Lecture: To be determined*

**Location:** SAMSI Classroom

**Speaker:** Deborshee Sen, First-Year SAMSI Postdoctoral Fellow

## Abstract

To be determined

## References

No references provided at this time

## November 13, 2019

*Lecture: To be determined*

**Location:** SAMSI Classroom

**Speaker:** Jaffer Zaidi, First-Year SAMSI Postdoctoral Fellow

## Abstract

To be determined

## References

No references provided at this time

## November 20, 2019

*Lecture: To be determined*

**Location:** SAMSI Classroom

**Speaker:** Maggie Mao, First-Year SAMSI Postdoctoral Fellow

## Abstract

To be determined

## References

No references provided at this time

## November 27, 2019

**** NO LECTURE – Thanksgiving Break ****

## December 4, 2019

*Lecture: To be determined*

**Location:** SAMSI Classroom

**Speaker:** Jason Poulos, First-Year SAMSI Postdoctoral Fellow

## Abstract

To be determined

## References

No references provided at this time