Seminar for Postdoctoral Fellows, Spring 2016

Postdoc Seminar

Dates: January 20, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Jyotishka Datta

Title: Sparse Signal Recovery for Discrete or Continuous Data

Abstract: Sparse signal detection has been one of the most important challenges in the analysis of large-scale data-sets arising from many different disciplines, e.g. Genomics, Finance and Astronomy. In this talk, I will focus on two key aspects of inference on a high-dimensional sparse mean vector: (1) how to provide theoretical justifications for existing methods that perform strongly, and (2) how to use this theoretical insight to develop new approaches that can outperform the current methods in the ‘ultra-sparse’ regime. In the first half of the talk, I will discuss multiple testing optimality for continuous data, and prove Oracle properties of the popular `Horseshoe’ prior [1]. I will then develop a novel prior called the ‘Horseshoe+’ prior [2] that sharpens the ‘Horseshoe’ prior’s signal detection abilities. I will illustrate that the Horseshoe+ prior outperforms the existing methods both in theory and practice and correctly identifies the `differentially expressed' genes from microarray data. In the second half, I will briefly discuss inference on high dimensional sparse count data which is fundamentally different from the high-dimensional Gaussian case. I will present the ‘Gauss-Hypergeometric’ prior for sparse Poisson means [3], motivated by the growing interest in analyzing sparse count data and end with an application to detect mutational hotspots in whole exome sequencing data.

References:
[1] Datta, J. and Ghosh, J. K. (2013). Asymptotic properties of Bayes risk for the horseshoe prior. Bayesian Analysis, 8(1):111–131.
[2] Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2015). The Horseshoe+ Estimator of Ultra-Sparse Signals. arXiv preprint arXiv:1502.00560.
[3] Datta, J. and Dunson, D. B. (2015). Priors for High-Dimensional Sparse Poisson Means. arXiv preprint arXiv:1510.04320. (Biometrika, under revision)

Postdoc Seminar

Dates: January 27, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Andrew Brown, Clemson University (CCNS visitor)

Title: Bayesian Correlated Signal Detection

Abstract: Over the last decade, large-scale multiple testing has found itself at the forefront of modern data analysis. In many applications the data are correlated, so that the observed test statistic used for detecting a non-null case, or signal, at each location in a dataset carries some information about the chances of a true signal at other locations. It is known that classical multiplicity corrections become strongly conservative in the presence of a massive number of tests, while other scalable approaches such as FDR control can lose precision or power when the assumption of independence in the data does not hold. In this informal talk, I will present some (relatively) recent work in which we introduce a CAR model to account for spatial dependence in a Bayesian testing model for large scale inference. I will focus in particular on an application to fMRI where the model leads to improved identification of neural activation patterns known to be associated with eye movement tasks. Time permitting, I will discuss at the end my thoughts about future research directions and interests relevant to the SAMSI CCNS program.

Postdoc Seminar

Dates: February 10, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Adam Jaeger

Title: Topological Data Analysis: Looking at the Underlying Geometry of Data

Abstract: Several modern statistical methodologies revolve around what is commonly referred to as the 'large p small n' problem where the number of variables collected for each observations greatly exceeds the total number of observations. Many methods designed for high dimensional data either reduce the number of variables or find appropriate functional combinations. One drawback of this variable selection/combination approach is the view that that each variable is by itself a measure of interest. In complex data structures, such as fMRI data, each variable is not necessarily a measure of interest unto itself, but rather a part describing the total structure of interest. Topological Data Analysis (TDA) allows us to examine the underlying topology of the data, thus allowing us to analyze observations in terms of overall structure instead of variable by variable. I will discuss the basic ideas and approaches behind TDA, focusing on summarizing the persistent homology of a data set and potential for utilizing this information in a statistical framework. I will follow with examples of applying TDA methodology to my current projects at SAMSI: classification of the tertiary structure of proteins and feature identification and comparison with functional neurological data.

Postdoc Seminar

Dates: February 17, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Ben Risk

Title: Large covariance estimation for spatial functional data with an application to MRI twin studies

Abstract: Twin studies can be used to disentangle the environmental and genetic contributions to brain structure and function. We develop a method for decomposing random effects into additive, common environmental, and unique environmental components for large surface or volume data. We propose two approaches for estimating covariance functions that are used to predict random effects. The first is scalable to large covariance matrices (e.g., 32,000 locations) but requires storage of the full covariance matrix. The second utilizes random projections to avoid explicitly calculating the full covariance matrix and can be applied to hundreds of thousands of points (e.g., 328,000). Simulation studies demonstrate improvements over univariate approaches. We present preliminary results from an analysis of cortical thickness data from the Human Connectome Project.

Postdoc Seminar

Dates: February 24, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Sarah Vallélian

Title: Towards Reduced Order Models for Multispectral Photoacoustic Tomography

Abstract: Photoacoustic tomography (PAT) is an emerging biomedical imaging modality which combines the high spatial resolution of ultrasound tomography with the high contrast of optical tomography. The goal is to reconstruct desired optical properties on the interior of a domain of interest, using measured ultrasound data collected outside the domain. Mathematically this is an inverse problem to determine the coefficients of a diffusion equation, using measured pressure values coming from an acoustic wave equation which is coupled to the diffusion equation via the photoacoustic effect. A popular and well-performing reconstruction approach for this inverse problem is via regularized least-squares minimization. However, this approach is computationally intensive for large-scale (i.e. high-resolution and/or multispectral) images, as it requires repeatedly solving the PDEs in large dimensions. We seek to reduce the computational burden by replacing the PDE models with cheaper yet accurate surrogates which have smaller dimension. In this talk, we present some results for a reduced-order model for the wave equation and discuss next steps towards reduced-order models for the diffusion equation, and for the wavelength dependence in the multispectral case. This is current work with Arvind Saibaba in the Computational Inverse Problems working group.

Postdoc Seminar

Dates: March 2, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Zhengwu Zhang

Title: Robust brain structural connectivity extraction and analysis

Abstract: Structural connection in an individual brain plays a fundamental role in how the mind responds to everyday tasks and challenges. Modern imaging technology such as diffusion MRI (dMRI) makes it easy to peer into an individual brain and collect valuable data to infer the structural connectivity. The difficulty for current statistical analysis and inference of the connectivity of human brain is to extract robust and high-resolution connectivity network from dMRI. In this talk, I will present a state-of-the-art pipeline to process the dMRI and extract robust high-resolution connectome of brain. The pipeline includes streamline construction, macroscopic scale connectivity matrix extraction, connectivity matrix compression, and robust connectivity coupling strength inference. Human Connectome Project Dataset will be used to demonstrate this pipeline.

Postdoc Seminar

Dates: March 9, 2016 - 1:15pm - 2:15pm

Location: Room 203

Speaker: Daniel Taylor-Rodriguez

Title: Selecting an Optimal Predictive Approximation for the Regression Function with Bernstein Polynomials using an OBayes Approach

Abstract: The order of smoothness chosen in nonparametric estimation problems is critical, balancing a tradeoff between model generality and data overfitting. Commonly used strategies in this context are cross-validation (CV) and Generalized cross-validation (GCV). However, these alternatives may result computationally costly and preclude quantification of the uncertainty surrounding this choice. As an alternative, we take an objective Bayes approach to select an appropriate order of smoothness while simultaneously assessing its uncertainty. Although our method can be applied to any series-based smoother, our focus is on approximations arising from Bernstein polynomial expansions, which are shape preserving, among their many other desirable features. For the single predictor case, we prove that choosing as the order of smoothness the degree of the median probability model with our approach is optimal for prediction. Extending the method to multiple predictors is not a trivial task, as now one must to choose the polynomial order for the Bernstein basis functions of each predictor and for all their higher order interactions. As such, these model spaces can become prohibitively large extremely fast. To control the growth of model space we develop a semi-greedy algorithm that builds the model space adaptively using the available data. We compare the performance of our method to that of CV and GCV through simulations. Both our method and GCV are orders of magnitude faster than CV, and all methods have comparable predictive accuracy. Two real data examples with a single predictor are also analyzed. Finally, to illustrate the extension to multiple predictors we provide preliminary results from a small simulation experiment. [This is a joint work with Sujit Ghosh, NCSU/SAMSI].

Postdoc Seminar

Dates: March 16, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Yize Zhao

Title: Bayesian multiresolution prior models and their biomedical

applications

Abstract: It is of great need to develop efficient statistical models to select important features from large scale biomedical datasets. Recently, multiscale/multiresolution type of idea has been incorporated in the new developed models to impose sparsity, introduce correlation and allow information transition. In this talk, I will first introduce a new Bayesian multiresolution variable selection prior model, which is an extension of my previous work on neuroimaging application. The new prior model can introduce prior dependence among features and reduce the posterior computational complexity compared with existing point mass mixture priors. In the second part, I will further extend this idea and talk about a new Bayesian multiresolution Guassian process selection prior with its potential application on an imaging genetics study. We hope to realize whole genome wise complex trait analysis with ultra-high dimensional imaging phenotypes as well as select imaging traits whose heritability can be explained by genetics architecture.

Postdoc Seminar

Dates: March 23, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Duy Hoang Thai

Title: Directional Mean Curvature for Textured Image Deconvolution and Decomposition

Abstract: Approximation theory plays an important role in image processing, especially image deconvolution and decomposition. For piecewise smooth images, there are a plethora of methods that have been developed over the past several decades. The goal of this study is to illustrate one challenging problem in texture analysis which has applications in the Forensics program, e.g. fingerprinting, ballistic images and shoe prints. In particular, it is known that texture information is almost destroyed by a blur operator, e.g. a blurred ballistic image is captured from a low-cost microscope. The contribution of this work is twofold: firstly, we propose a mathematical model for textured image deconvolution and decomposition into multiple meaningful components, especially by using a fourth-order PDE approach based on the directional mean curvature. Secondly, we uncover a link between functional analysis and multiscale sampling theory, e.g. harmonic analysis and filter banks. This is preliminary work (with David Banks in the Forensics program) for the next project: A Correlation Based Approach to Quality and Noise in Crime Scene Finger and Shoeprints.

Postdoc Seminar

Dates: March 30, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Christopher Strickland

Title: Modeling Parasitoid Wasp Dispersal from Point Release

Abstract: Parasitic hymenoptera are a group of insects which are critical for biological pest control and increasingly being used in agriculture to protect crops via direct release. However, due to their small size (often less than 1 mm), movement and long-distance dispersal of these wasps have long been poorly understood and likely underestimated. Recent data collected by Kristensen et al. (2013) on the wind-borne dispersal pattern of Eretmocerus hayati (0.7 mm long) provides a new and significant opportunity to finally develop a detailed, validated, multi-scale model for the initial spread of invasive insects and biological control introductions.

In this talk I will present a new mathematical model for parasitoid wasp dispersal from point release, as in the case of biocontrol. The model is derived from underlying stochastic processes and, as a special case of the Fokker-Planck equation, is fully deterministic yet robust to changing wind conditions and a variety of non-linear take-off responses. The Python implementation of this model is capable of running month long simulations on the scale of 15 km^2 while maintaining a resolution of 5 m^2, all within two minutes on a common workstation. Speed is an essential component to our model because it allows us the possibility of using Bayesian methods to fit parameters to data.

Postdoc Seminar

Dates: April 13, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Lucas Mentch

Title: Supervised Learning in Forensic Science

Abstract: Modern learning algorithms are typically seen as prediction-only tools, meaning the interpretability and intuition provided by a more traditional modeling approach are sacrificed in order to achieve superior predictions. In this talk, we argue that this black-box perspective need not always be the case. We demonstrate that predictions from ensemble learners like bagged trees and random forests, when built with subsamples in lieu of full bootstrap samples, can be viewed as incomplete, infinite-order U-statistics and as such, are asymptotically normal. Furthermore, we show that the limiting variance depends only on the size of the ensemble relative to the size of the training set and by enforcing a structure on the subsamples used in the ensemble, we can form a consistent estimate of variance at no additional computational cost. This allows for statistical inference to be carried out in practice and we'll conclude this talk by demonstrating how such an approach can be applied to problems in criminology and forensic science in areas such as quality metric evaluation and the identification of randomly acquired characteristics on shoe prints and tire tracks.

Postdoc Seminar

Dates: April 20, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Cedric Neumann, South Dakota State University (Forensics Science visitor)

Title: Modeling the Spatial Relationships between Incidental Characteristics on Items of Forensic Interest

Abstract: Observations made on items of forensic interest are difficult to summarize and describe mathematically. In particular, pattern evidence, such as fingerprint or shoeprint evidence, display many different types of features that can be represented by different types of variables. The high dimension and heterogenous nature of variables observed on pattern evidence render difficult (not to say impossible) the characterization of their joint likelihood structure. Since 1892, several models have been proposed to quantify the probative value of fingerprint evidence. These models have either heavily relied on the assumption of independence between fingerprint features, or on the measure of similarity between pairs of observed patterns to reduce the dimension of the problem. During this talk, we will briefly review some of the issues with earlier models and describe a model that aims at addressing them by modeling the spatial relationship between fingerprint features. Some limitations of the proposed model are discussed.

(with the help of many colleagues, and in particular Prof. Michael Lavine from U. Mass., and Jonah Amponsah and Dr. Christopher P. Saunders from SDSU)

Postdoc Seminar

Dates: April 27, 2016 - 1:15pm - 2:15pm

Location: Room 150

Speaker: Kimberly Kaufeld

Title: A Multivariate Dynamic Spatial Factor Model for Speciated Pollutants and Adverse Birth Outcomes

Abstract: Researchers believe that exposure to high concentrations of air pollution during pregnancy may significantly increase the risk of birth defects and other adverse birth outcomes. While current regulations put limits on total PM2.5 concentrations, there are many speciated pollutants within this size class that likely have varying effects on perinatal health. However, due to correlations between these speciated pollutants it can be difficult to decipher their effects in a model for birth outcomes. To combat this difficulty we develop a multivariate spatio-temporal Bayesian model for the speciated particulate matter using dynamic spatial factors. These spatial factors can then be coherently interpolated to provide measurements at the pregnant mothers’ homes to be used in a model for birth outcomes. The model for birth outcomes allows the impacts of pollutants to vary across different weeks of the pregnancy in order to identify susceptible periods. The proposed methodology is illustrated using pollutant monitoring data from the Environmental Protection Agency and birth records from the National Birth Defect Prevention Study