Postdoctoral Fellow Seminars: Fall 2019

August 28, 2019

Special Guest Lecture: Interpretable Machine Learning: Optimal Decision Trees and Optimal Scoring Systems

Location: SAMSI Classroom
Speaker: Cynthia Rudin, Prof. of Computer Science, Electrical and Computer Engineering, and Statistical Science, Duke University and Assoc. Director, SAMSI

Bio

Cynthia Rudin is a professor of computer science, electrical and computer engineering, and statistical science at Duke University, and directs the Prediction Analysis Lab, whose main focus is in interpretable machine learning. Previously, Prof. Rudin held positions at MIT, Columbia, and NYU. She holds an undergraduate degree from the University at Buffalo, and a PhD from Princeton University. She is a three time winner of the INFORMS Innovative Applications in Analytics Award, was named as one of the “Top 40 Under 40” by Poets and Quants in 2015, and was named by Businessinsider.com as one of the 12 most impressive professors at MIT in 2015. She is past chair of both the INFORMS Data Mining Section and the Statistical Learning and Data Science section of the American Statistical Association. She has also served on committees for DARPA, the National Institute of Justice, and AAAI. She has served on three committees for the National Academies of Sciences, Engineering and Medicine, including the Committee on Applied and Theoretical Statistics, the Committee on Law and Justice, and the Committee on Analytic Research Foundations for the Next-Generation Electric Grid. She is a fellow of the American Statistical Association and a fellow of the Institute of Mathematical Statistics. She will be the Thomas Langford Lecturer at Duke University during the 2019-2020 academic year.

Abstract

How do patients and doctors know that they can trust predictions from a model that they cannot understand? Transparency in machine learning models is critical in high stakes decisions, like those made every day in healthcare. My lab creates machine learning algorithms for predictive models that are interpretable to human experts. I will focus on two historical hard optimization problems whose solutions are important in practice:

(1) Optimal sparse decision trees and optimal sparse rule list models. Our algorithms are highly customized branch and bound procedures. These are an alternative to CART and other greedy decision tree methods. The solutions are globally optimal according to accuracy, regularized by the number of leaves (sparsity). This problem is NP-hard with no polynomial time approximation. I will present the first practical algorithms for this problem.

(2) Optimal scoring systems. Scoring systems are sparse linear models with integer coefficients. Traditionally, scoring systems have been designed using manual feature elimination on logistic regression models, with a post-processing step where coefficients have been rounded. However, this process does not produce optimal solutions. I will present a novel cutting plane method for producing scoring systems from data. The solutions are globally optimal according to the logistic loss, regularized by the number of terms (sparsity), with coefficients constrained to be integers.

These algorithms have been used for many medical applications and criminal justice applications.

Work with Margo Seltzer and Berk Ustun, as well as Elaine Angelino, Nicolas Larus-Stone, Daniel Alabi, Sean Hu, and Jimmy Lin.

References

Presentation Slides

Video


September 4, 2019

Special Guest Lecture: There is a Kernel Method for That

Location: SAMSI Classroom
Speaker: Ernest Fokoue, Professor of Statistics, Rochester Institute of Technology

Bio

Ernest Fokoué is Professor of Statistics in the School of Mathematical Sciences at Rochester Institute of Technology. He enjoys the honor of being the primogenito of SAMSI postdoctoral fellows that got the institute going with the Data Mining and Machine Learning (DMML) program in 2003. He is one of the co-leaders of the 2019-2020 SAMSI Games, Decisions, Risk and Reliability (GDRR) program, and will be spending his whole sabbatical year contributing to its activities. His areas of research and teaching interests are Bayesian Statistics, Statistical Machine Learning, Computational Statistics, Epistemology, Theology and Linguistics.

Abstract

In this lecture, I will present a general tour of some of the most commonly used kernel methods in statistical machine learning and data mining. I will touch on elements of artificial neural networks and then highlight their intricate connections to some general purpose kernel methods like Gaussian process learning machines. I will also resurrect the famous universal approximation theorem and will most likely ignite a [controversial] debate around the theme: could it be that [shallow] networks like radial basis function networks or Gaussian processes are all we need for well-behaved functions? Do we really need many hidden layers as the hype around Deep Neural Network architectures seem to suggest or should we heed Ockham’s principle of parsimony, namely “Entities should not be multiplied beyond necessity.” (“Entia non sunt multiplicanda praeter necessitatem.”) I intend to spend the last 15 minutes of this lecture sharing my personal tips and suggestions with our precious postdoctoral fellows on how to make the most of their experience.

Presentation Slides

Video


September 11, 2019

Special Guest Lecture: Attacking the Curse of Dimensionality using Sums of Separable Functions

Location: SAMSI Classroom
Speaker: Martin Mohlenkamp, Assoc. Professor, Dept. of Mathematics, Ohio University

Abstract

Naive computations involving a function of many variables suffer from the curse of dimensionality: the computational cost grows exponentially with the number of variables. One approach to bypassing the curse is to approximate the function as a sum of products of functions of one variable and compute in this format. When the variables are indices, a function of many variables is called a tensor, and this approach is to approximate and use the tensor in the (so-called) canonical tensor format. In this talk I will describe how such approximations can be used in numerical analysis and in machine learning.

Presentation Slides

Video


September 18, 2019

Lecture: Multifidelity Computer Model Emulation with High-Dimensional Output

Location: SAMSI Classroom
Speaker: Pulong Ma, Second-Year SAMSI Postdoctoral Fellow

Abstract

Hurricane-driven storm surge is one of the most deadly and costly natural disasters, making precise quantification of the surge hazard of great importance. Physics-based computer models of storm surge can be implemented with a wide range of fidelity due to the nature of the system, though the danger posed by surge makes greater fidelity highly desirable. However, such models and their high-dimensional outputs tend to come at great computational cost, which can make highly detailed studies prohibitive. These needs make the development of an emulator combining high-dimensional output from multiple complex computer models with different fidelity levels important. We propose a parallel partial autoregressive cokriging model that is able to address these issues. Based upon the data-augmentation technique, model parameters are estimated via Monte Carlo expectation-maximization algorithm and prediction is made in a computationally efficient way when input designs across different fidelity levels are not nested. With this methodology, the high-fidelity storm surges can be generated much more quickly in coastal flood studies, and hence can facilitate the risk assessment of storm surge hazards.

References

No references provided at this time


September 25, 2019

Lecture: Analyzing Collective Motion with Machine Learning and Topology

Location: SAMSI Classroom
Speaker: John Nardini, Second-Year SAMSI Postdoctoral Fellow

Abstract

We use topological data analysis and machine learning to study a seminal model of collective motion in biology. This model describes agents interacting nonlinearly via attractive-repulsive social forces and gives rise to collective behaviors such as flocking and milling. To classify the emergent collective motion in a large library of numerical simulations and to recover model parameters from the simulation data, we apply machine learning techniques to two different types of input. First, we input time series of order parameters traditionally used in studies of collective motion. Second, we input measures based in topology that summarize the time-varying persistent homology of simulation data over multiple scales. This topological approach does not require prior knowledge of the expected patterns. For both unsupervised and supervised machine learning methods, the topological approach outperforms the traditional one.

References

No references provided at this time


October 2, 2019

Lecture: On the Trilogy of Nonparametric Methods: Models, Inference, Misspecification

Location: SAMSI Classroom
Speaker: Wenjia Wang, Second-Year SAMSI Postdoctoral Fellow

Abstract

Non-parametric methods provide a flexible framework to understand the relationship between the input and output of complex systems. Despite the successful application of nonparametric methods, many of them do not have a clear characterization of their behavior from a theoretical point of view. My talk has two parts. In Part I, I will present recent works on some error bounds for the Gaussian process modeling, which is a standard non-parametric method used in computer experiments. In Part II, I will provide a multi-resolution functional ANOVA model which can work for large-scale and many-input problems.

References

No references provided at this time


October 9, 2019

Lecture: Representation Learning via Disentangled Variational Autoencoders

Location: SAMSI Classroom
Speaker: Matthias Sachs, SAMSI Postdoctoral Fellow and Duke Researcher

Abstract

In this talk I will present some insight into representation learning in the context of the variational autoencoder framework which I gained during a SAMSI-industry collaboration with a biostatistics group at the Bayer corporation. I will briefly explain the general idea behind a Variational Autoencoder (VAE) motivating the construction of a VAE by the problem of parameterizing a complex generative model and continue with the discussion of VAE approaches aiming at a disentangled code representation (i.e., a code representation whose components correspond to independent factors of the parametrized generative model). I will close the talk by presenting results obtained by applying the discussed techniques to patients’ monitor data in the context of the industry collaboration.

References

No references provided at this time


October 16, 2019

Lecture: (1) Drivers Learn City-scale Dynamic Equilibrium (2) Probability Estimation on Manifolds via Diffusion

Location: SAMSI Classroom
Speaker: Ruda Zhang, First-Year SAMSI Postdoctoral Fellow

Abstract

The first part is to prepare for a talk at INFORMS 2020 next week. This paper studies the taxi industry as a game of multi-market competition among firms of equal capacity, where taxi drivers allocate service time across the street network to maximize income. We prove that the game has a Nash equilibrium, which is symmetric, essentially unique, and globally asymptotically stable under gradient adjustment process and imitative learning. With 2009-2013 trip records of New York City yellow cabs, we validate that taxi drivers’ behavior conforms to our prediction, and that drivers learn the equilibrium strategy over time.

The second part proposes a method for approximating probability distributions on manifolds. Recent advances in statistics and machine learning have exploited the low-dimensional manifold structure of high-dimensional data sets. Being able to estimate probability distributions on such manifolds allows for generative models (i.e. sampling) and statistical inference. With heat diffusion as a unifying concept for density estimation on Euclidean spaces and non-trivial manifolds, our paper solves this task using an approximate Neumann heat operator.

References

No references provided at this time


October 23, 2019

Lecture: Network Analysis for Microbiome Data

Location: SAMSI Classroom
Speaker: Xinyi Li, Second-Year SAMSI Postdoctoral Fellow

Abstract

In this talk, I will first give a brief introduction of microbiomes, which play a central role in many biological process. The interaction networks between the species of microorganisms often attracts a lot of attention from researchers. I will then give an overview of the methods that are widely used to infer the network. In addition, I will introduce our proposed method, which accounts for the characteristics of microbiome data, and evaluate its performance in simulation studies.

References

No references provided at this time


October 30, 2019

Lecture: To be determined

Location: SAMSI Classroom
Speaker: Bianca Dumitrascu, First-Year SAMSI Postdoctoral Fellow

Abstract

To be determined

References

No references provided at this time


November 6, 2019

Lecture: To be determined

Location: SAMSI Classroom
Speaker: Deborshee Sen, First-Year SAMSI Postdoctoral Fellow

Abstract

To be determined

References

No references provided at this time


November 13, 2019

Lecture: To be determined

Location: SAMSI Classroom
Speaker: Jaffer Zaidi, First-Year SAMSI Postdoctoral Fellow

Abstract

To be determined

References

No references provided at this time


November 20, 2019

Lecture: To be determined

Location: SAMSI Classroom
Speaker: Maggie Mao, First-Year SAMSI Postdoctoral Fellow

Abstract

To be determined

References

No references provided at this time


November 27, 2019

** NO LECTURE – Thanksgiving Break **


December 4, 2019

Lecture: To be determined

Location: SAMSI Classroom
Speaker: Jason Poulos, First-Year SAMSI Postdoctoral Fellow

Abstract

To be determined

References

No references provided at this time