Postdoctoral Fellows Seminars – Spring 2015

Contagions for topological data analysis of networks

January 21, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Dane Taylor

Abstract

The study of contagions on networks is central to the understanding of collective social processes and epidemiology. When a network is constrained by an underlying manifold such as Earth’s surface (as in most social and transportation networks), it is unclear how much spreading processes on the network reflect such underlying structure, especially when long-range edges are also present. We address the question of when contagions spread predominantly via the spatial propagation of wavefronts (e.g., as observed for the Black Death) rather than via the appearance of spatially-distant clusters of contagion (as observed for modern epidemics). To provide a concrete scenario, we study the Watts threshold model (WTM) of social contagions on what we call noisy geometric networks, which are spatially-embedded networks that consist of both short-range and long-range edges. Our approach involves using multiple realizations of contagion dynamics to map the network nodes as a point cloud, for which we analyze the geometry, topology, and dimensionality. We apply such maps, which we call WTM maps, to both empirical and synthetic noisy geometric networks. For the example of a noisy ring lattice, our approach yields excellent agreement with a bifurcation analysis of the contagion dynamics. Importantly, we find for certain dynamical regimes that we can identify the network’s underlying manifold in the point cloud, highlighting the utility of our method as a tool for inferring low-dimensional (e.g., manifold) structure in networks. Our work thereby finds a deep connection
between the fields of dynamical systems and nonlinear dimension reduction.


Network Spread and Control of Ecological Invasive Species

January 28, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Christopher Strickland

Abstract

I will introduce a method for coupling vector-based transportation networks (e.g. agents traveling by vehicle) onto a spatially continuous model of a biological epidemic. Analysis for relatively slow establishing invasions yields a unique, stable, steady-state solution with an optimal control for the infected vectors, with network topology and the potential for the invasion to be self-sustaining affecting the efficacy of the control. Numerical results are shown for the cheatgrass invasion of Rocky Mountain National Park based on the presence model of Strickland et al.
2014.


Bayesian Hierarchical Variable Selection for Genome-wide Association Studies

February 4, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Yize Zhao

Abstract

It becomes increasingly important in the genome-wide association studies (GWAS) to select important genetic information in relation to a dichotomous variable or a quantitative trait. For instance, in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study, determination of disease related genetic factors plays a vital role in the prevention of Alzheimer’s Disease (AD) at early stage. Currently, the discovery of biological association among SNPs motivates the strategy to construct the SNP-sets along the genome, which motivates the strategy to incorporate the grouping information into the selection procedure to increase the selection power and facilitate more biological meaningful result. To this end, we proposed a unified Bayesian framework which allows the hierarchical variable selection at both SNP-set (group) level and SNP (within group) level while simultaneously encouraging the grouping effect among SNPs based on the SNP-sets. To accommodate the ultra high-dimensionality of the data set, we overcome the limitation of existing posterior updating scheme in the Bayesian variable selection methods and propose a novel sampling scheme. By constructing an auxiliary variable selection model under SNP-set level, the new procedure utilizes the posterior samples of the auxiliary model to subsequently guide the posterior inference for the target SNP-level selection model. We apply the proposed method to a variety of simulation studies and show that our method is computational efficient and achieve substantially better performance than completing approaches in both selection of SNP-sets and SNPs. Applying the method to the ADNI data, we identify meaningful genetic information that are highly associated with several different neuroimaging phenotypes. Our method is general and readily to be applied to a wide range of biomedical studies.


Statistical Inference on the Three-Dimensional Structure of the Genome by Truncated Poisson Architecture Model

February 11, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Jincheol Park (SAMSI visitor)

Abstract

It has been gradually understood that long-range interactions between a gene and its regulatory elements is one of the key phenomena in gene regulation. Hence, a traditional one-dimensional linear view of the genome, which is especially prevalent in mathematical modeling, falls far short of the underlying gene regulation mechanism. Therefore, it is now an essential part in studying genomic function to estimate three-dimensional (3D) structure of the genome. In recent years, the Hi-C assay, aided by the Next Generation Sequencing (NGS) technology, has yielded genome-wide information of physical contacts between distant genome loci. The availability of genome-wide interaction data promotes development of various analytical methods to recover the underlying 3D spatial chromatin structure. In this article, in order to directly model sequencing counts with many zeros, we introduce a truncated Poisson architecture model. Zeros in the count matrix produced by Hi-C delivers little information for inference on the distances between loci. Therefore, by excluding such imprecise information, the truncated model is expected to enhance the accuracy of prediction of chromosome organization, leading to greater consistency of the resulting chromatin structure and biological functions.


CANCELED DUE TO ADVERSE WEATHER

February 18, 2015, 1:15pm – 2:15pm
Room 150


WASP: Scalable Bayes via barycenters of subset posteriors

February 25, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Sanvesh Srivastava

Abstract

The promise of Bayesian methods for big data sets has not fully been realized due to the lack of scalable computational algorithms. For massive data, it is necessary to store and process subsets on different machines in a distributed manner. We propose a simple, general, and highly efficient approach, which first runs a posterior sampling algorithm in parallel on different machines for subsets of a large data set. To combine these subset posteriors, we calculate the Wasserstein barycenter via a highly efficient linear program. The resulting estimate for the Wasserstein posterior (WASP) has an atomic form, facilitating straightforward estimation of posterior summaries of functionals of interest. The WASP approach allows posterior sampling algorithms for smaller data sets to be trivially scaled to huge data. We provide theoretical justification in terms of posterior consistency and algorithm efficiency. Examples are provided in complex settings including Gaussian process regression and nonparametric Bayes mixture models.

The talk is based on a joint work with Volkan Cevher (EPFL), David B. Dunson (Duke University), and Quoc Tran-Dinh (EPFL).


A multiscale adaptive learning algorithm for high-dimensional data

March 11, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Wenjing Liao

Abstract

Many data sets in applications are in a high-dimensional space but exhibit a low-dimensional structure. It is of great interest to learn a dictionary which provides sparse representations for these data. We will discuss a multiscale geometric method for this purpose. Our method is based on a multiscale partition of the data and then constructing piecewise affine approximations. It features adaptivity in the sense that our algorithm automatically learns the distribution of the data and chooses the right partition to be used. A rate of convergence in terms of the
number of samples will be given for a wide class of distributions from which data are sampled.


A Bayesian Semiparametric area-level model for small area estimation

March 25, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Neung Soo Ha

Abstract

In survey statistics, small area methods and models are used to produce estimates (for instance, average salaries) for geographical areas or population subgroups for which the sample is too small to support direct estimation. One well known model is the Fay-Herriot model, which can be interpreted as a linear mixed effects model in which normality for random effects is assumed. Because random effects are not observed, it is difficult to check the assumption of normality (or any other parametric assumption). In this presentation, we consider extensions of the Fay-Herriot model in which the default normality assumption for the random effects is replaced by a non-parametric specification. We explore the estimation of individual area means as well as the distribution of their ensemble. Viability of the approach and the effects are investigated
using the National Survey of Recent College Graduates to estimate average salaries for different demographic subgroups.


Multiresolution nonparametric Bayesian cluster detection and association testing for whole genome sequencing studies with applications in primary immune deficiency study

April 1, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Jyotishka Datta

Abstract

Rare variants play a critical role in explaining the genetic contribution to complex diseases by accounting for disease risk and trait variability, previously unexplained by large GWAS. The association of rare variants with disease has proved to be challenging using existing methods that merely compare their frequencies in different datasets. There is need for powerful new statistical methods that incorporate spatial locations of variants, allow incorporation of previous gene ontology information, scale to massive dimensions, and appropriately characterize uncertainty in inferences. We developed a multiresolution cluster detection method using binary tree to recursively partition the chromosome and prune ‘uninteresting’ intervals in a top-down fashion. We then developed a novel scalable Bayesian nonparametric methods to draw inference from the point process model. These methods provided several key advantages including robustness, adaptability to the underlying disease architecture, interpretability of clusters, and biologically relevant segmentation of the genome compared to widely applied methods that assess variant frequency. We applied these methods in 240 cases of patients with primary immune deficiency to identify patterns of genetic variation underlying the disease compared to over 7000 controls and identified novel gene mutations in HRNR that may be related to the regulation of BTK, a gene that is critical in signaling and B-cell development. Our approach had excellent performance in whole genome and exome sequence data, showing fast, accurate detection of clusters and substantial gains in computing speed relative to the existing approaches. Our methods are extensible across a large range of disease models and provides a number of advantages including scalability, incorporation of important covariates and adjustment for population stratification.

This is a joint work with Anupama Reddy, David Dunson, Sandeep Dave all at Duke University.


On minimizing the sum of N convex functions

April 8, 2015, 1:15pm – 2:15pm
Room 150
Speaker: Minh Pham

Abstract

In penalized models, an optimization problem of a function F(x)=f(x) + g(x) is solved. There are many efficient methods such as Nesterov proximal gradient and ADMM. In some applications, the objective function has more than two components such as in fMRI image reconstruction with total variation and sparse wavelet domain or support vector machine with variable selection. In theory, these methods should be applicable, however, in practice, they tend to have some difficulties with implementation and slow convergence. In this talk, I will discuss an algorithm based on alternating linearization to minimize the sum of an arbitrary finite N convex functions, brief convergence analysis, and a small numerical example. The method is a generalization of (Kiwiel, Rosa, Ruszczynski 1999). It retains the simplicity of ADMM method with fast convergence, and descent properties of objective function value.


Modeling Savanna Water Resource Dynamics with Stochastic Daily Rainfall

April 29, 2015, 1:15pm – 2:15pm
Room 203
Speaker: Christopher Strickland

Abstract

Modeling has become an essential part of understanding ecosystem dynamics, and within the savanna ecology community, models are used as a key tool to advance theories about the determinants of savanna as an ecological state between forest and grassland. Many of these models adopt mean annual precipitation (MAP) as the primary variable for water resources, based on the observation of Sankaran \emph{et al.} (2005) that MAP is highly correlated with an upper bound on maximum woody cover in African savannas and their subsequent hypothesis that MAP can be used to determine climatic savanna stability. In this talk, I will introduce a new, theoretical water-resource model based on FLAMES, a process-based, demographic model developed by Liedloff and Cook (2007) for Australian savannas. Analysis of this model using stochastic, daily rainfall distributions suggests that length and severity of the dry season is the true driver behind climatic savanna equilibria, and also provides a method to predict the total basal area and structure of a stand given the local rainfall record. Finally, I will describe the effect of various fire regimes on the model, and the implications of these results for management.


On the estimation of the order of smoothness of the regression function

May 6, 2015, 1:30pm – 2:30pm
Room 203
Speaker: Daniel Taylor-Rodriguez

Abstract

The order of smoothness chosen in nonparametric estimation problems is critical. This choice balances the tradeoff between model generality and data overfitting. The most common approach used in this context is cross-validation. However, cross-validation is computationally costly and precludes valid post-selection inference without further considerations. As an alternative, we take an objective Bayes approach not only to select the appropriate order of smoothness but also to simultaneously assess the uncertainty in such a selection. The proposed methods are automatic, in the sense that no user input is required as noninformative priors are used on model parameters, are computationally inexpensive, and can be extended to the case with multiple predictors. We explore this problem in greater generality, presenting comparative analyses using both simulated and real data.


Assessing Bark Beetle outbreaks, an imputation approach

May 13, 2015, 1:30pm – 2:30pm
Room 203
Speaker: Kimberly Kaufeld

Abstract

The forests of the western region of the United States have changed dramatically over the last ten years due to the increase in bark beetle damage. Predicting the occurrence of bark beetle outbreaks is a large concern for the forest service, as targeting areas where beetles spread helps utilize resources to mitigate the bark beetle outbreaks. In Colorado, data are collected using an aerial detection survey (ADS) each year to assess the amount of damage from bark beetle outbreaks. We use the data collected from the ADS from 2000-2010 to both understand and predict the occurrence of bark beetle outbreaks. A spatio-temporal beta regression model is constructed using the percent of damage in a particular region. The spatial information is collected using a sparse conditional autoregressive model and a dynamic linear model is used to capture the progression of the amount of bark beetle damage that occurred in the previous year. We compare a binary spatio-temporal model and zero augmented spatio-temporal model to our imputed spatio-temporal beta model where areas that were originally recorded as not damaged are noted as detection errors. We find that the imputed model provides better estimates in the north central region of the Rocky Mountains.