# Postdoctoral Fellows Seminars – Fall 2015

## Orientation for New SAMSI Postdocs

August 26, 2015, 10:00am – 1:30pm
Room 150

Important information about life at SAMSI will be shared. The orientation session will be followed by lunch in the SAMSI Commons.

## Binary Classification Using Significant Features of fMRI Data

September 9, 2015, 1:15pm – 2:15pm
Room 150

### Abstract

Binary classification is the task of using information from individuals to place them into one of two distinct groups. The information can be viewed as features of the individual, some of which may be necessary for proper classification while other features may not contributed any information to group placement. In many cases we are interested in not only determining which group an individual belongs to, but also identifying the minimum set of features necessary to place the individual. An approach to determine both classification and feature selection is to combine linear discriminant analysis (LDA) with a threshold function. LDA creates a classifier function to determine group placement, and by applying a threshold function to the classifier, we can determine which features contribute information towards group placement. When using fMRI data this method would, in theory, allow for classification of certain conditions along with identification of the brain areas that show different behavior between affected and unaffected individuals. The practical application of this approach is confounded by multiple issues, such as the large p small n problem and the nature of the data itself. In this talk I will demonstrate application of this method along with alternate classification methods to work around the small n large p value. Additionally I will suggest possible modifications such as alternate classifier functions, computational alternatives, and changes to fMRI data preprocessing to address the practical challenges of this method.

## Joint Species distribution modeling: dimension reduction using Dirichlet processes

September 16, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Kimberly Kaufeld

### Abstract

A primary goal of ecology is to understand the fundamental processes underlying the geographic distributions of species. To understand this relationship ecologists use species distribution models (SDMs). It models the relationship between the abundance of species and environmental variables as well as predicts the species’ geographic ranges from occurrence and environmental data at the same sites. Recently, interest has focused on modeling multiple species through joint species distribution models (JSDMs) that accommodate for multiple species rather than modeling each species independently. JSDMs estimate distributions of multiple species simultaneously and allow decomposition of species co-occurrence patterns into components describing shared environmental responses and residual patterns of co-occurrence. However, modeling and model fitting of joint species distributions when there are several species on the order of hundreds causes methodological and computational challenges. There are too many parameters to estimate for a large number of species making it computationally expensive and sometime infeasible. We develop a dimension reduction technique to model JSDMs more efficiently. We use a latent multivariate Gaussian model with a generalized Dirichlet process to cluster species. Simulation results show the method to be computationally efficient and highly scalable.

## Shrinkage Priors for Sparse High-Dimensional Discrete or Continuous Data

September 23, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Jyotishka Datta

### Abstract

In the first half of the talk, we will introduce the ‘horseshoe+’ prior for ultra-sparse signal detection that sharpens the ability of the horseshoe prior that has achieved success in the sparse signal recovery problem. The Horseshoe+ prior builds upon these advantages and enjoys superior theoretical properties over its competitors as well as computational feasibility in high dimensions. In this talk, we shall discuss the attractive features of the Horseshoe+ prior, from both theoretical and practical point of view. Theoretical guarantees include super-efficiency for density estimation in a Kullback-Leibler sense, lower mean squared error in estimating signals compared to the horseshoe, and Bayes optimality in multiple testing. We will also illustrate superior performance of Horseshoe+ against its competitors such as the Horseshoe and Dirichlet-Laplace prior and in standard design setting as well as a prostate cancer data set and point out some directions for future research. (This is joint work with Nicholas Polson, Anindya Bhadra, Brandon Willard). In the second half of the talk, we talk about shrinkage priors for sparse Poisson means motivated by the growing interest in analyzing high dimensional sparse count data in a variety of application areas. For example, in cancer genomic studies it is of interest to model the counts of mutated alleles across the genome. Existing approaches for analyzing multivariate count data via Poisson log-linear hierarchical models cannot flexibly adapt to the level and nature of sparsity in the data. We develop a new class of local-global shrinkage priors tailored for sparse counts. We will assess theoretical properties including posterior concentration, super-efficiency in estimating the sampling density, and robustness in posterior mean. An efficient and scalable sampling algorithm is developed for posterior inference. Simulation studies illustrate excellent small sample properties relative to competitors, and we apply the method to model rare mutation data from the Exome Aggregation Consortium project. (This is a joint work with David Dunson).

## Computational Methods for Inverse Problems in Neuroimaging: Current and Proposed Research Activities

September 30, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Sarah Vallélian

### Abstract

In inverse problems, one is interested in reconstructing some parameters of interest in a given system, from measured data of the system output. Inverse problems in neuroimaging are particularly challenging from a computational perspective as they are characterized by both complex high-dimensional models and large-scale data sets. The goals of the Computational Inverse Problems working group are to develop new computational tools and techniques to address these challenges, which are suitable for a broad range of neuroimaging modalities.

In this talk I will present some research directions which are the subject of the working group at present. I will focus in particular on current work with Arvind Saibaba on reduced-order modeling for multispectral photoacoustic tomography, a relatively new and very promising neuroimaging modality currently used in small animal trials.

## Opening Pandora’s Box In FMRI: A Review of IRAP and Current Progress of the Working Group

October 7, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Daniel Rowe, Marquette University (CCNS visitor)

### Abstract

In fMRI and fcMRI, many reconstruction and processing steps are performed on our data and a large portion of the data is omitted from analysis. The processing changes our image mean, variance, and may induce correlation. The magnitude image operation process also changes the voxel’s statistical distribution. Newer image reconstruction processes can also induce long range correlations. The IPAR working group has been reviewing some image processing and reconstruction methods. Since complex-valued images are always the result, complex-valued analysis methods will also be
reviewed.

## Bias and Uncertainty in Forensic Evidence

October 14, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Lucas Mentch

### Abstract

Forensic evidence and the reporting of such evidence by forensic scientists plays a crucial role in the criminal justice system in the United States. However, with the (possible) exception of DNA evidence, serious questions exist as to the validity and consistency of other forms of forensic evidence. I will begin this talk with a general overview of some of these ongoing issues and challenges, followed by a more detailed description of some of the problems that we hope to address in the months ahead. I will focus first on the role of contextual bias in both how evidence is evaluated by examiners and also how jurors might perceive and interpret such biases. I will then discuss the role of matching algorithms in evaluating suspects and describe a general procedure for quantifying the uncertainty of such algorithms in the context of
fingerprint evidence. Finally, I’ll end with some thoughts on paths forward and potential future research.

## Bayesian Feature Selection for Ultra-high Dimensional Imaging Genetics Data

October 21, 2015, 1:15pm – 2:15pm
Room 104
Speaker: Yize Zhao

### Abstract

Our work is motivated by imaging genetics studies with the goal to perform feature selection among multivariate phenotypes and ultra-high dimensional genotypes. For instance, in the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the ideal imaging genetics association analysis is whole-brain whole-genome wise. By associated each genetic marker with individual imaging trait in a joint framework, one could identify disease related genetic factors and corresponding brain locations based on highly associated genotype-phenotype pairs. Currently, few works tackle this problem efficiently due to the presence of ultra-high dimensional data and multivariate response variable. In this work, we propose a novel multilevel sequential selection procedure under a Bayesian multivariate response regression model (MRRM) to select informative features among multivariate responses and ultra high-dimensional predictors. Specifically, we treat the identification of nonzero elements in the sparse coefficient matrix into a hierarchical feature selection problem by first selecting potential nonzero rows among the matrix (genotype selection) and then localizing the nonzero elements within the marked rows (phenotype selection). The genotype-wise selection is accomplished by constructing multilevel auxiliary selection models under different scales with the actual scale auxiliary model treated as another level for the ultimate phenotype-wise selection. This procedure allows the posterior inference be “reweighted” with concentrating more efficiently on the potential signals in a sequential fashion, which dramatically reduces the computational cost and improves the mixing of Markov chain. Extensive simulations are provided to show the superiority of our method compared with several completing approaches. We also apply the method to the ADNI with biologically meaningful results obtained.

## A functional structural equation model on spatial domains for estimating heritability from twin data

October 28, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Benjamin Risk

### Abstract

Twin studies can be used to disentangle the environmental and genetic contributions to brain structure and function. The Human Connectome Project has generated large amounts of preprocessed imaging data from twin pairs. A structural equation model (SEM) can be used to estimate a trait’s heritability. A massive univariate analysis would estimate an SEM at each location in the brain. An important question is whether the genetic contribution is significant over a region-of-interest. Extending the massive univariate modeling approach to arbitrary spatial domains requires an estimation of the covariance functions. Here we propose a structural equation model for functional data in which we model spatial dependencies on the sphere. Our framework allows for inference over arbitrary domains of the cortex. Additionally, the approach could be used for improving
predictions. We present preliminary results and future directions.

## Structural Brain Connectivity Analysis on Human Connectome Project Dataset

November 4, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Zhengwu Zhang

### Abstract

The connection structure in an individual’s brain plays a fundamental role in how the mind responds to everyday tasks and life’s challenges. Modern imaging technology makes it easy to peer into an individual’s brain and collect valuable data. This project focuses on two threads: (1) exploring the state-of-the-art data processing algorithms to reliably measure brain structural networks non-invasively in live humans; (2) developing the mathematical tools necessary for understanding these data, e.g. understanding how the brain connectivity varies among healthy individuals and according to phenotypes (behavioral traits, neurological disorders, etc.), genetic factors, and demographics (age, gender, etc.). In this talk, I will present our recent progress on the Human Connectome Project Dataset and point out the ongoing and potential research projects.

## Nonparametric Bayesian Variable Selection, Clustering and Prediction for High-Dimensional Regression

November 11, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Subharup Guha, University of Missouri-Columbia (CCNS visitor)

### Abstract

The development of parsimonious models for reliable inference and prediction of responses in high-dimensional regression settings is often challenging due to relatively small sample sizes and the presence of complex interaction patterns between a large number of covariates. We propose an efficient, nonparametric framework for simultaneous variable selection, clustering and prediction in high-throughput regression settings with continuous or discrete outcomes, called VariScan.

The VariScan model utilizes the sparsity induced by Poisson-Dirichlet processes (PDPs) to group the covariates into lower-dimensional latent clusters consisting of covariates with similar patterns among the samples. The data are permitted to direct the choice of a suitable cluster allocation scheme, choosing between PDPs and their special case, Dirichlet processes. Subsequently, the latent clusters are used to build a nonlinear prediction model for the responses using an adaptive mixture of linear and nonlinear elements, thus achieving a balance between model parsimony and flexibility.

We investigate theoretical properties of the VariScan procedure that differentiate the allocations patterns of PDPs and Dirichlet processes both in terms of the number and relative sizes of their clusters. Contrary to conventional belief, cluster detection is shown to be aposteriori consistent for a general class of models as the number of covariates and subjects grows, guaranteeing the high accuracy of the model-based clustering procedure. Additional theoretical results establish model selection and prediction consistency. Through simulation studies and analyses of benchmark data sets, we demonstrate the reliability of VariScan’s clustering mechanism and show that the technique compares favorably to, and often outperforms, existing methodologies in terms of the prediction accuracies of the responses.

## Multiphase Segmentation For Simultaneously Homogeneous & Textural Images

November 18, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Duy Hoang Thai

### Abstract

In their seminal paper from 1989, Mumford and Shah proposed a model (but NP-hard due to the Hausdorff 1-dimensional measure $\mathcal H^1$ in $\mathbb R^2$) for piece-wise smooth (for image restoration) and piece-wise constant (for image segmentation) by minimizing the energy functional. Later, Chan and Vese have proposed the active contour model for two phase image segmentation which is solved by a level set method. However, these models do not apply to the larger class of natural images that simultaneously contains texture and piecewise smooth information. By the calculus of variation, we design a bi-level constrained minimization model for a simultaneous multiphase homogeneous and textural (on a defined scale) image segmentation by solving a relaxed version of a non-convex (due to a binary setting of a non-convex set) minimization. The cornerstone of this minimization is to introduce novel norms which are defined in different functional Banach spaces (with the discrete setting), e.g. homogeneous regions, texture and residual are measured by directional total variation norm, directional G-norm and a dual of a generalized version of the Besov space, respectively. The key ingredients of this study are: (1) the assumption of the sparsity of a signal in some transform domains; (2) the Banach space G in Meyer’s model to measure the oscillatory components e.g. texture, which do not have a small norm in $L_1(\mathbb R^2)$ or $L_2(\mathbb R^2)$; (3) a smooth surface and sharp edges in geometric objects in cartoon along with a smooth and sparse
texture by the DG3PD model.

## WHIM: Function Approximation WHere It Matters

December 2, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Michael Lavine, University of Massachusetts Amherst (CCNS Visitor)

### Abstract

Statisticians working with a log likelihood function f typically seek f’s maximum and, perhaps, local curvature. However, optimization algorithms can be fooled by local maxima and, even when they’re not fooled, the curvature at the global maximum does not paint a complete picture of f. A new algorithm WHIM, for function approximation WHere It Matters, paints a more complete picture and does not get fooled by local maxima. Specifically, WHIM divides the parameter space into two parts and estimates f to within a pre-specified tolerance everywhere in one part, while guaranteeing that f is far below its maximum on the other part.

This talk introduces WHIM, explains what features make f amenable to WHIM, and shows how several common statistical models contain those features.

## Clustering the Diurnal Patterns in Maize Leaf with RNA-Sequencing Data

December 9, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Daniel Taylor-Rodriguez

### Abstract

To study the diurnal patterns of gene expression profiles along maize leaf development, a set of RNA- sequencing experiments were conducted over 24 hours with samples taken every two hours from four different sections of maize leaves. There is particular interest in identifying groups of genes whose cyclic expression is similar throughout the day. As a first strategy, we model the mean expression using Fourier series on the mean component for each gene, and develop an algorithm using Dirichlet processes to find genes that cluster in the coefficients of their cyclic components. This first approach was implemented successfully using only a subset of 300 genes. Although the algorithm is relatively efficient on this smaller dataset, it’s ability to scale up to deal with the 19,000 genes contained in the dataset is limited. This being the case, it is desirable to partition the large clustering problem into smaller, more manageable clustering problems. To accomplish this, we assume an orthogonal basis representation that characterizes the mean cyclic component of each gene. Due to the orthogonal nature of the basis functions, only genes that share the same basis functions can potentially belong to the same cluster. We make use of this feature by incorporating a Bayesian variable selection step, which for each gene chooses the relevant basis functions. Genes whose cyclic components include the same basis functions form a very coarse partition. Within each these coarse partitions we now make use of the clustering procedure initially proposed to find genes that have similar cyclic behavior. In this talk, we will provide some preliminary details of our research.