Multiscale regression and learning
August 27, 2014, 1:15pm – 2:15pm
Speaker: Wenjing Liao
I will present the construction and analysis of a multiscale estimator for the regression problem. The estimator consists of a least-square fitting procedure using piecewise constant functions on a partition which is generated by a splitting procedure. Moreover, the estimator can be adaptive and it does not depend on any a priori assumptions about the regression function to be estimated. My presentation is based on the paper “Universal algorithms for learning theory Part I: piecewise constant function” by Binev, Cohen, Dahmen, Devore and Temlyakov.
Expandable Factor Analysis
September 3, 2014, 1:15pm – 2:15pm
Speaker: Sanvesh Srivastava
Bayesian sparse factor models have proven useful for characterizing dependencies in high- dimensional data. However, lack of computational scalability to high-dimensions (P) with unknown numbers of factors (K) remains a vexing issue. We propose a framework for expandable factor analysis (xFA), where expandable refers to the ability of scaling to high-dimensional settings by adaptively adding additional factors as needed. Key to this behavior is the use of a novel multiscale generalized double Pareto (mGDP) prior for the loadings matrix. The mGDP prior is carefully structured to induce sparsity in the loadings, allow an unknown number of factors, and produce an objective function for maximum a posteriori estimation that factorizes to yield P separate weighted l1-regularized regressions. Model averaging is used to remove sensitivity due to the form of mGDP prior and number of factors. We provide theoretical support, develop efficient computational algorithms, and evaluate xFA’s perfor- mance using simulated data and genomic applications. Computational efficiency is further improved via one-step estimation. Joint work with David Dunson (Duke University) and Barbara Engelhardt (Princeton University) ArXiv Link: http://arxiv.org/abs/1407.1158
Professional Development Lunch
September 17, 2014, 12:00pm – 1:30pm
Speaker: Robert Rodriguez
Professional development lunch with Robert Rodriguez of SAS Institute.
WTM maps for complex contagion on noisy geometric networks
September 24, 2014, 1:15pm – 2:15pm
Speaker: Dane Taylor
Social and biological contagions are often strongly influenced by the spatial embedding of networks. In some cases (e.g., Black Death), contagions spread as a wave through space. In many modern contagions, however, long-range edges (e.g., due to airline transportation or communication media) allow clusters of a contagion to arise in distant locations. We study these competing phenomena for the Watts threshold model (WTM) of complex contagions on empirical and synthetic noisy geometric networks, which are networks that are spatially–embedded on a manifold and consist of both short-range and long-range edges. Our approach involves constructing WTM maps that use multiple contagions to map the nodes as a point cloud, which we analyze using tools from data topology and homology. Importantly, for contagions predominantly exhibiting wavefront propagation, we often identify a noisy geometric network’s underlying manifold in the point cloud, highlighting our approach also as a tool for inferring low-dimensional (e.g., manifold) structure in networks. Our work thereby finds a deep connection between the fields of
dynamical systems and nonlinear dimension reductions.
Large Scale Simultaneous Testing with Continuous or Discrete Data
October 1, 2014, 1:15pm – 2:15pm
Speaker: Jyotishka Datta
In the recent past, thanks to advances in next-generation sequencing technologies as well as examples in other fields, multiple independent tests have become very popular. Many large scale applications involve multiple testing, where one simultaneously tests a small proportion of true signals in presence of a large number of noise observations. In the first half of the talk, we discuss the recent advances in Bayesian theory and methodology for the multiple testing problem when the test statistics are continuous and can be suitably transformed to have a normal distribution. (based on an ongoing work). In the second half of the talk, motivated by recent collaborative research work at SAMSI within the bioinformatics program, we discuss high dimensional testing problems with discrete valued observations. Recent advances in sequencing technologies such as RNaseq or DNAase-seq routinely produce count data as observations. There is a growing need to develop new testing procedures to handle discrete data that scale efficiently to a huge number of locations. We will present an update on a few existing approaches to this problem and their relative merits and demerits.
1. Datta, J and Ghosh, J.K.. Asymptotic properties of Bayes risk for the horseshoe prior. BayesianAnalysis, 8(1):111–131, 2013a. ISSN 1936-0975.
2. Shim, H., & Stephens,M. (2013). Wavelet-based genetic association analysis of functional phenotypesarising from high-throughput sequencing
assays. arXiv preprintarXiv:1307.7203
Hierarchical Feature Selection of the Complex Biomedical Data
October 8, 2014, 1:15pm – 2:15pm
Speaker: Yize Zhao
Recent advances in biomedical technologies enable scientists to produce complex data for more valuable insights into biomedical research. This brings new challenges to develop efficient statistical methods for extracting important features from such data while accommodating complex data structure and integrating relevant scientific findings. In the first part of the talk, motivated by a prostate cancer study, we discuss the hierarchical feature selection of the mRNA and miRNA biomarkers in relation to disease risk. In the selection procedure, both known biological information on pathway membership and novel network captured the miRNA-gene regulation are incorporated to facilitate more biological meaningful results. Motivated by the hierarchical feature selection idea, in the second part of the talk, we discuss an ongoing SAMSI collaborative project on the genome-wide association studies (GWAS). Under a Bayesian approach, we propose a sequential sampling scheme, which captures the structural information, to hierarchically select genetic information. Based on the current simulation studies, the proposed approach dramatically improves the selection accuracy and computational efficiency.
A general nonparametric method for correcting the allele frequency spectrum for misidentified ancestral states
October 15, 2014, 1:15pm – 2:15pm
Speaker: Chris Nasrallah
The allele frequency spectrum is an effective tool for the detection of positive selection. But the power of the frequency spectrum depends on being able to correctly determine which alleles are derived and which are ancestral. Misidentification of the derived state mimics the effect of positive selection, leading to frequent false positives. The problem is compounded when the variants are indels, because misidentification leads to classifying an insertion as a deletion (and vice versa). The ancestral allele is often determined by estimating the allele carried by the common ancestor of the lineage of interest and an outgroup species, and using this as a proxy for the true ancestral allele. This is problematic, even when a probabilistic model is used to do so, due to multiple mutations on the lineage of interest. Requiring consistency with additional outgroup species does not fully resolve the misidentification and results in much of the data being discarded. Here we present a straightforward nonparametric method for correcting the observed allele frequency spectrum for misidentified ancestral states. The method uses the frequency of polymorphic and fixed site patterns to estimate the ancestral state probabilities and treats the observed allele frequency spectrum as a mixture of correctly identified and misidentified sites. The method makes no assumptions about the presence or absence of natural selection, the model of evolution, or the exact phylogenetic tree relating the species. Additionally the method can be used for both SNPs and indels and can be used with
any group of species.
A method for site-occupancy model estimation and objective Bayesian selection
October 22, 2014, 1:15pm – 2:15pm
Speaker: Daniel Taylor-Rodriguez
Using presence-absence data, site-occupancy models are used to estimate the proportion of area occupied by a biological species. In surveys, observed zeroes can occur because the species of interest is truly absent from a site, or because it was present but remained undetected. By surveying repeatedly each site occupancy models resolve the ambiguity in an observed zero (non-detections), separating detection and occupancy probabilities. In spite of the popularity of these models in the ecological literature, variable selection in this context is mostly limited to using AIC, which requires enumerating and fitting every model in the model space. Other Bayesian alternatives are available, but these rely either on parameter priors that require substantial previous knowledge –commonly unavailable for all parameters if the number of parameters is large– or priors that, in spite of attempting to be “objective”, are not suitable for model comparison. First, we present a formulation of the occupancy model with probit links and use data-augmentation to make the parameter posterior probabilities tractable. This specification suggests a formulation of the Bayesian selection procedure in terms of the data-augmented variables, conveniently helping derive “objective” intrinsic priors for this problem. Additionally, to enable the algorithm for large model spaces we propose a fast stochastic search strategy. Finally, preliminary ideas will be discussed on the use of Bernstein Polynomials to model the detection and occupancy probabilities with greater flexibility, and also on how to use this approach to develop a multi-species occupancy model.
A spatio-temporal model for mountain pine beetle damage
October 29, 2014, 1:15pm – 2:15pm
Speaker: Kimberly Kaufeld
Species distribution models combine observations of species occurrence (presence/absence) with environmental estimates to gain ecological insight and predict species across landscapes. Modeling the presence/absence of damage (i.e. tree mortality from mountain pine beetles) over a region is analogous to species distribution models where the probability of damage is equivalent to the probability of species occurrence in a region. However, aggregating a region as a binary response (damage or not damaged) may result in a loss of information. That is, classifying a region as damaged acts as if the whole region is damaged when in reality only a proportion of the region is impacted. To better capture the nature of the data we use proportions of damaged areas rather than binary data. In the first part of the talk, mountain pine beetle data is used to formulate a spatio-temporal damage model using a stick-breaking representation to account for the proportion of cumulative damage occurred over time. Additionally, we demonstrate the utility of predicting beetle damage by highlighting areas that display a higher probability of outbreak in future time periods. In the second part of the talk, motivated by the SAMSI working group, we discuss a joint species distribution model that accounts for the joint occurrence of multiple types of damage.
Professional Development Lunch
November 5, 2014, 11:45am – 1:00pm
Speaker: Nisha Cavanaugh
Professional development lunch with Dr. Nisha Cavanaugh, Director of the NC State Office of Postdoctoral Affairs.
(Postdoc Seminar will follow lunch).
Modeling Non-Local Invasive Spread on Heterogeneous Domains with a Vector-Based Transportation Network
November 5, 2014, 1:15pm – 2:15pm
Speaker: Christopher Strickland
Biological invasions represent an interesting challenge to model mathematically. Landscape heterogeneity, non-local spreading mechanisms, and the presence of long-distance transportation connections are but a few of the complications that can greatly affect invasive spread. In this talk, I will begin by discussing diffusion from a distributional point of view, which will then motivate the idea of non-local diffusion along with non-local growth/spread equations. I will then introduce a generalization of Mollison’s stochastic contact birth process (J R Stat Soc 39(3):283, 1977) which is robust to heterogeneity in the landscape. By interpreting the quantity of interest as species occurrence probability rather than population size, I will describe how this process may also be studied in a presence-absence context.
Next, I will introduce a method for including the effects of a disease-vector transportation network on the model. Given a strongly connected, directed graph of transportation rates, we assume that carriers (e.g. cars, trucks, hikers) can unwittingly transport a biological invader to distant sites without the possibility of cross infection. Following a possible latent stage, the invader then begins to establish in the new location, spreading outward of the domain. Analysis of the network portion of the model shows a unique stable steady state solution with an optimal discrete control of the infected vectors.
As the Physical Ecology working group considers the problems of long range spider dispersal, epidemic disease spread via airline networks, and invasive spread in marine environments via shipping lanes, the different models described in this talk will provide mathematical context and possible modeling approaches for answering some of these ecological questions.
High-dimensional joint Bayesian variable and covariance selection: Applications in eQTL analysis and cancer genomics
November 12, 2014, 1:15pm – 2:15pm
Speaker: Anindya Bhadra (Visiting assistant professor from the Department of Statistics, Purdue University)
We describe a Bayesian technique to (a) perform a sparse joint selection of significant predictor variables and significant inverse covariance matrix elements of the response variables in a high-dimensional linear Gaussian sparse seemingly unrelated regression (SSUR) setting and (b) perform an association analysis between the high-dimensional sets of predictors and responses in such a setting. To search the high-dimensional model space, where both the number of predictors and the number of possibly correlated responses can be larger than the sample size, we demonstrate that a marginalization-based collapsed Gibbs sampler, in combination with spike and slab type of priors, offers a computationally feasible and efficient solution. We demonstrate our method in an eQTL data set (SNPs as predictors and mRNA as responses) and in a glioblastoma data set (microRNA and copy number aberration as predictors and mRNA as responses). If time permits, we will also describe ongoing work on generalizations to non-linear, non-Gaussian models.
Nonparametric Bayesian Method for Complex Surveys with Limited Design Information
November 19, 2014, 1:15pm – 2:15pm
Speaker: Neung Soo Ha
Survey weights are arguably the most important attributes for making design-based inferences from a complex survey. They are derived from the sampling mechanism of the survey and post-adjustments process. For data users, it is common to use weights, which are attached in the survey, without the complete information on the sampling design mechanism nor weighting adjustment process. The current Bayesian methods for survey analysis typically ignore survey weights but assume that all weighting information can be captured by modeling. In this presentation, we explore a hierarchical Bayesian approach in which we model the distribution of the weights for the nonsampled units in the population and include them as predictors in a Gaussian process regression model. For simplicity, we only consider a univariate survey response. We apply our method to the
National Survey for Recent College Graduates to estimate average salaries for different demographic groups.
Subspace clustering with linear transformation
December 3, 2014, 1:15pm – 2:15pm
Speaker: Minh Pham
Subspace clustering is a problem that deals with large collection of high-dimensional data, such as images and videos. The goal of the problem is to segment data points into intrinsic linear subspaces. The problem attracts interests from computer vision and data mining communities. In this talk, I will discuss two methods: Sparse Subspace Clustering (SSC) from Elhamifar and Vidal, 2012, and Low-Rank Representation from Liu et al 2012, and a possible extension using linear transformation from Sapiro and Qiu, 2014.
Bayesian Data Editing and Imputation for Continuous Microdata
December 10, 2014, 1:15pm – 2:15pm
Speaker: Hang Kim
Many government agencies and statistical organizations collect data that are expected to satisfy linear constraints; as examples, component variables should sum to total variables, and ratios of pairs of variables should be bounded by expert-specified constants. When reported data violate constraints, organizations identify and replace values potentially in error in a process known as edit-imputation. To date, most approaches separate the error localization and imputation steps, typically using optimization methods to identify the variables to change followed by hot deck imputation. We present an approach that fully integrates editing and imputation for continuous microdata under linear constraints. Our approach relies on a Bayesian hierarchical model that includes a flexible joint probability model for the underlying true values of the data with support bounded by editing constraints and a stochastic error localization model which suggests possible location of errors in the reported data. The stochastic error localization allows relationships in the data to inform error localization and imputation processes while fully reflecting uncertainty. The approach also avoids the computationally difficult exercise of identifying feasible regions from a set of linear constraints. We illustrate the potential advantages of the Bayesian editing approach over existing approaches using the application study with the 2007 U.S. Census of Manufactures.