Postdoctoral Fellow Seminars: Fall 2019

August 28, 2019

Special Guest Lecture: Interpretable Machine Learning: Optimal Decision Trees and Optimal Scoring Systems

Location: SAMSI Classroom
Speaker: Cynthia Rudin, Prof. of Computer Science, Electrical and Computer Engineering, and Statistical Science, Duke University and Assoc. Director, SAMSI

Bio

Cynthia Rudin is a professor of computer science, electrical and computer engineering, and statistical science at Duke University, and directs the Prediction Analysis Lab, whose main focus is in interpretable machine learning. Previously, Prof. Rudin held positions at MIT, Columbia, and NYU. She holds an undergraduate degree from the University at Buffalo, and a PhD from Princeton University. She is a three time winner of the INFORMS Innovative Applications in Analytics Award, was named as one of the “Top 40 Under 40” by Poets and Quants in 2015, and was named by Businessinsider.com as one of the 12 most impressive professors at MIT in 2015. She is past chair of both the INFORMS Data Mining Section and the Statistical Learning and Data Science section of the American Statistical Association. She has also served on committees for DARPA, the National Institute of Justice, and AAAI. She has served on three committees for the National Academies of Sciences, Engineering and Medicine, including the Committee on Applied and Theoretical Statistics, the Committee on Law and Justice, and the Committee on Analytic Research Foundations for the Next-Generation Electric Grid. She is a fellow of the American Statistical Association and a fellow of the Institute of Mathematical Statistics. She will be the Thomas Langford Lecturer at Duke University during the 2019-2020 academic year.

Abstract

How do patients and doctors know that they can trust predictions from a model that they cannot understand? Transparency in machine learning models is critical in high stakes decisions, like those made every day in healthcare. My lab creates machine learning algorithms for predictive models that are interpretable to human experts. I will focus on two historical hard optimization problems whose solutions are important in practice:

(1) Optimal sparse decision trees and optimal sparse rule list models. Our algorithms are highly customized branch and bound procedures. These are an alternative to CART and other greedy decision tree methods. The solutions are globally optimal according to accuracy, regularized by the number of leaves (sparsity). This problem is NP-hard with no polynomial time approximation. I will present the first practical algorithms for this problem.

(2) Optimal scoring systems. Scoring systems are sparse linear models with integer coefficients. Traditionally, scoring systems have been designed using manual feature elimination on logistic regression models, with a post-processing step where coefficients have been rounded. However, this process does not produce optimal solutions. I will present a novel cutting plane method for producing scoring systems from data. The solutions are globally optimal according to the logistic loss, regularized by the number of terms (sparsity), with coefficients constrained to be integers.

These algorithms have been used for many medical applications and criminal justice applications.

Work with Margo Seltzer and Berk Ustun, as well as Elaine Angelino, Nicolas Larus-Stone, Daniel Alabi, Sean Hu, and Jimmy Lin.

References

Presentation Slides

Video


September 4, 2019

Special Guest Lecture: There is a Kernel Method for That

Location: SAMSI Classroom
Speaker: Ernest Fokoue, Professor of Statistics, Rochester Institute of Technology

Bio

Ernest Fokoué is Professor of Statistics in the School of Mathematical Sciences at Rochester Institute of Technology. He enjoys the honor of being the primogenito of SAMSI postdoctoral fellows that got the institute going with the Data Mining and Machine Learning (DMML) program in 2003. He is one of the co-leaders of the 2019-2020 SAMSI Games, Decisions, Risk and Reliability (GDRR) program, and will be spending his whole sabbatical year contributing to its activities. His areas of research and teaching interests are Bayesian Statistics, Statistical Machine Learning, Computational Statistics, Epistemology, Theology and Linguistics.

Abstract

In this lecture, I will present a general tour of some of the most commonly used kernel methods in statistical machine learning and data mining. I will touch on elements of artificial neural networks and then highlight their intricate connections to some general purpose kernel methods like Gaussian process learning machines. I will also resurrect the famous universal approximation theorem and will most likely ignite a [controversial] debate around the theme: could it be that [shallow] networks like radial basis function networks or Gaussian processes are all we need for well-behaved functions? Do we really need many hidden layers as the hype around Deep Neural Network architectures seem to suggest or should we heed Ockham’s principle of parsimony, namely “Entities should not be multiplied beyond necessity.” (“Entia non sunt multiplicanda praeter necessitatem.”) I intend to spend the last 15 minutes of this lecture sharing my personal tips and suggestions with our precious postdoctoral fellows on how to make the most of their experience.

Presentation Slides

Video


September 11, 2019

Special Guest Lecture: Attacking the Curse of Dimensionality using Sums of Separable Functions

Location: SAMSI Classroom
Speaker: Martin Mohlenkamp, Assoc. Professor, Dept. of Mathematics, Ohio University

Abstract

Naive computations involving a function of many variables suffer from the curse of dimensionality: the computational cost grows exponentially with the number of variables. One approach to bypassing the curse is to approximate the function as a sum of products of functions of one variable and compute in this format. When the variables are indices, a function of many variables is called a tensor, and this approach is to approximate and use the tensor in the (so-called) canonical tensor format. In this talk I will describe how such approximations can be used in numerical analysis and in machine learning.

Presentation Slides

Video


September 18, 2019

Lecture: Multifidelity Computer Model Emulation with High-Dimensional Output

Location: SAMSI Classroom
Speaker: Pulong Ma, Second-Year SAMSI Postdoctoral Fellow

Abstract

Hurricane-driven storm surge is one of the most deadly and costly natural disasters, making precise quantification of the surge hazard of great importance. Physics-based computer models of storm surge can be implemented with a wide range of fidelity due to the nature of the system, though the danger posed by surge makes greater fidelity highly desirable. However, such models and their high-dimensional outputs tend to come at great computational cost, which can make highly detailed studies prohibitive. These needs make the development of an emulator combining high-dimensional output from multiple complex computer models with different fidelity levels important. We propose a parallel partial autoregressive cokriging model that is able to address these issues. Based upon the data-augmentation technique, model parameters are estimated via Monte Carlo expectation-maximization algorithm and prediction is made in a computationally efficient way when input designs across different fidelity levels are not nested. With this methodology, the high-fidelity storm surges can be generated much more quickly in coastal flood studies, and hence can facilitate the risk assessment of storm surge hazards.

References

No references provided at this time


September 25, 2019

Lecture: Analyzing Collective Motion with Machine Learning and Topology

Location: SAMSI Classroom
Speaker: John Nardini, Second-Year SAMSI Postdoctoral Fellow

Abstract

We use topological data analysis and machine learning to study a seminal model of collective motion in biology. This model describes agents interacting nonlinearly via attractive-repulsive social forces and gives rise to collective behaviors such as flocking and milling. To classify the emergent collective motion in a large library of numerical simulations and to recover model parameters from the simulation data, we apply machine learning techniques to two different types of input. First, we input time series of order parameters traditionally used in studies of collective motion. Second, we input measures based in topology that summarize the time-varying persistent homology of simulation data over multiple scales. This topological approach does not require prior knowledge of the expected patterns. For both unsupervised and supervised machine learning methods, the topological approach outperforms the traditional one.

References

No references provided at this time


October 2, 2019

Lecture: On the Trilogy of Nonparametric Methods: Models, Inference, Misspecification

Location: SAMSI Classroom
Speaker: Wenjia Wang, Second-Year SAMSI Postdoctoral Fellow

Abstract

Non-parametric methods provide a flexible framework to understand the relationship between the input and output of complex systems. Despite the successful application of nonparametric methods, many of them do not have a clear characterization of their behavior from a theoretical point of view. My talk has two parts. In Part I, I will present recent works on some error bounds for the Gaussian process modeling, which is a standard non-parametric method used in computer experiments. In Part II, I will provide a multi-resolution functional ANOVA model which can work for large-scale and many-input problems.

References

No references provided at this time


October 9, 2019

Lecture: Representation Learning via Disentangled Variational Autoencoders

Location: SAMSI Classroom
Speaker: Matthias Sachs, SAMSI Postdoctoral Fellow and Duke Researcher

Abstract

In this talk I will present some insight into representation learning in the context of the variational autoencoder framework which I gained during a SAMSI-industry collaboration with a biostatistics group at the Bayer corporation. I will briefly explain the general idea behind a Variational Autoencoder (VAE) motivating the construction of a VAE by the problem of parameterizing a complex generative model and continue with the discussion of VAE approaches aiming at a disentangled code representation (i.e., a code representation whose components correspond to independent factors of the parametrized generative model). I will close the talk by presenting results obtained by applying the discussed techniques to patients’ monitor data in the context of the industry collaboration.

References

No references provided at this time


October 16, 2019

Lecture: (1) Drivers Learn City-scale Dynamic Equilibrium (2) Probability Estimation on Manifolds via Diffusion

Location: SAMSI Classroom
Speaker: Ruda Zhang, First-Year SAMSI Postdoctoral Fellow

Abstract

The first part is to prepare for a talk at INFORMS 2020 next week. This paper studies the taxi industry as a game of multi-market competition among firms of equal capacity, where taxi drivers allocate service time across the street network to maximize income. We prove that the game has a Nash equilibrium, which is symmetric, essentially unique, and globally asymptotically stable under gradient adjustment process and imitative learning. With 2009-2013 trip records of New York City yellow cabs, we validate that taxi drivers’ behavior conforms to our prediction, and that drivers learn the equilibrium strategy over time.

The second part proposes a method for approximating probability distributions on manifolds. Recent advances in statistics and machine learning have exploited the low-dimensional manifold structure of high-dimensional data sets. Being able to estimate probability distributions on such manifolds allows for generative models (i.e. sampling) and statistical inference. With heat diffusion as a unifying concept for density estimation on Euclidean spaces and non-trivial manifolds, our paper solves this task using an approximate Neumann heat operator.

References

No references provided at this time


October 23, 2019

Lecture: Network Analysis for Microbiome Data

Location: SAMSI Classroom
Speaker: Xinyi Li, Second-Year SAMSI Postdoctoral Fellow

Abstract

In this talk, I will first give a brief introduction of microbiomes, which play a central role in many biological process. The interaction networks between the species of microorganisms often attracts a lot of attention from researchers. I will then give an overview of the methods that are widely used to infer the network. In addition, I will introduce our proposed method, which accounts for the characteristics of microbiome data, and evaluate its performance in simulation studies.

References

No references provided at this time


October 30, 2019

Lecture: Experimental Design in Genomics

Location: SAMSI Classroom
Speaker: Bianca Dumitrascu, First-Year SAMSI Postdoctoral Fellow

Abstract

The traditional biological research pipeline consists of three steps: hypothesis generation, data collection, and data analysis. Data analysis is sometimes followed by a readjustment in hypothesis assessment, allowing for an iterative approach to the scientific inquiry. With the decreasing costs of data collection in high-throughput genomics, and with the increasing number of groups pursuing interconnected questions, several experimental design challenges emerge. In this work, we address three experimental challenges motivated by advances in single-cell RNA-seq (scRNA-seq) technologies: budget allocation, marker selection and multi-modal data aggregation. First, we develop a novel heuristic for contextual bandit problems with logistic rewards and we show a new, bandit-inspired application to iterative experimental design in multi-tissue single-cell RNA-seq (scRNA-seq) data. We present two algorithms, a Good-Toulmin like estimator via Thompson sampling and a Pitman-Yor prior based approach with near optimal performance. Given a budget and modeling cell type information across tissues, they both estimate how many cells are required for sampling from each tissue with the goal of maximizing cell type discovery across samples from multiple iterations. Second, we consider the problem of marker selection in the context of multi modal data collection. Single-cell data analysis allows for the clustering of cells according to their genomic functionality as represented by their gene expression profiles. Such clustering can be achieved using a variety of methods and an active collaboration between experimentalists and computational groups. However, gene expression provides only one facet in depicting cell identity. Motivated by the emerging imaging technologies we present methods for selecting cluster and cluster hierarchy preserving subsets of marker genes that can optimize the imaging of population of cells.

References

No references provided at this time


November 6, 2019

Lecture: Bayesian Inferences on Uncertain Ranks and Orderings

Location: SAMSI Classroom
Speaker: Deborshee Sen, First-Year SAMSI Postdoctoral Fellow

Abstract

It is common to be interested in rankings or order relationships among entities. In complex settings where one does not directly measure a univariate statistic upon which to base ranks, such inferences typically rely on statistical models having entity-specific parameters. These can be treated as random effects in hierarchical models characterizing variation among the entities. The current literature struggles to present summaries of order relationships which appropriately account for uncertainty. A single estimated ranking can be highly misleading, particularly as it is common that the entities do not vary widely in the trait being measured, leading to large uncertainty and instability in ranking a moderate to large number of them. We observed such problems in attempting to rank player abilities based on data from the National Basketball Association (NBA). Motivated by this, we propose a general strategy for characterizing uncertainty in inferences on order relationships among parameters. Our approach adapts to scenarios in which uncertainty in ordering is high by producing more conservative results that improve interpretability. This is achieved through a reward function within a decision-theoretic framework. We show that our method is theoretically sound and illustrate its utility using simulations and an application to NBA player ability data.

References

No references provided at this time


November 13, 2019

Lecture: Detecting Individual Level ‘Always Survivor’ Causal Effects Under ‘Truncation by Death’ and Censoring Through Time

Location: SAMSI Classroom
Speaker: Jaffer Zaidi, First-Year SAMSI Postdoctoral Fellow

Abstract

The analysis of causal effects when the outcome of interest is possibly truncated by death has a long history in statistics and causal inference. The survivor average causal effect is commonly identified with more assumptions than those guaranteed by the design of a randomized clinical trial or using sensitivity analysis. This paper demonstrates that individual level causal effects in the `always survivor’ principal stratum can be identified with no stronger identification assumptions than randomization. We illustrate the practical utility of our methods using data from a clinical trial on patients with prostate cancer. Our methodology is the first and, as of yet, only proposed procedure that enables detecting causal effects in the presence of truncation by death using only the assumptions that are guaranteed by design of the randomized clinical trial.

References

No references provided at this time


November 20, 2019

Lecture: Decision-Adjusted Modeling for Telematics-based Driver Risk Assessment

Location: SAMSI Classroom
Speaker: Maggie Mao, First-Year SAMSI Postdoctoral Fellow

Abstract

Accurate assessment of driver risk is challenging due to the rarity of crashes and high variability among individual drivers. The emerging connected and automated vehicle technology provides rich in-situ telematics driving data that could provide personalized driving behavior information for risk assessment. This talk focuses on developing the optimal decision-driven telematics-based driver risk assessment model. Specifically, we propose a decision-adjusted predictive modeling approach to identify the optimal model setup according to the specific decision, and each setup is optimized using a hybrid of space filling design and central composite design. The Second Strategic Highway Research Plan naturalistic driving study (SHRP2 NDS), the largest NDS with more than 3400 participants, was used for model development and calibration. The results indicate that the proposed decision-adjusted framework outperforms the general model selection rule such as area under the curve (AUC), especially for rare-event data. The study also shows that using telematics information can improve individual driver risk assessment and the optimal thresholds vary according to the decision rules. The proposed method can be extended to other risk assessment applications in finance, cybersecurity, and health care.

References

No references provided at this time


November 27, 2019

** NO LECTURE – Thanksgiving Break **


December 4, 2019

Lecture: To be determined

Location: SAMSI Classroom
Speaker: Jason Poulos, First-Year SAMSI Postdoctoral Fellow

Abstract

To be determined

References

No references provided at this time