## Professional Development Lunch

**September 4, 2013 – 12:00pm, 1:30pm**

SAMSI Commons

*K-mappings and regression trees*

**September 18, 2013, 1:15pm – 2:15pm**

Room 150

Speaker: Yi Grace Wang

### Abstract

We describe a method for learning a piecewise affine approximation to a mapping f : R^d to R^p given a labeled training set of examples {x1, …, xn} = X \subset R^d and targets {y1 = f(x1), …, yn = f(xn)} = Y \subset R^p. The method first trains a binary subdivision tree that splits across hyperplanes in X corresponding to high variance directions in Y . A fixed number K of affine regressors of rank q are then trained via a K-means like iterative algorithm, where each leaf must vote on its best fit mapping, and each mapping is updated as the best fit for the collection of leaves that chose it.

*Semi-supervised eigenvectors and high-dimensional regression*

**September 25, 2013, 1:15pm – 2:15pm**

Room 150

Speaker: David Lawlor

### Abstract

In the past decade much attention has been paid to examining the connectivity of data sets as a means to extract low-dimensional structure from nominally high-dimensional data. This is usually accomplished by computing the leading eigenvectors of the Laplacian of the data connectivity graph, which are inherently global quantities taking into account the interactions among all data points. In data sets containing multiple subpopulations this may be disadvantageous, and a more local approach may be appropriate. In this talk we discuss such an approach, the recent semi-supervised eigenvectors of Mahoney et. al., and their potential application to regression in high dimensions. A motivating example using astronomical spectra will also be discussed.

*Characterizing and predicting trajectories of evolving networks*

**October 2, 2013, 1:15pm – 2:15pm**

Room 150

Speaker: Dane Taylor

### Abstract

Networks are ubiquitous in nature and technology, where examples range from social interactions to critical infrastructures to algorithms processing high-dimensional data. A central topic for many applications is the study of evolving networks, which includes both how to observe evolution as well as how to predict it. Although investigations of evolving networks have been growing in recent years, little work has adopted a dynamical systems perspective in which the evolving network can be thought of as trajectory in a potentially high-dimensional space.

The main hindrance in such a pursuit has been choosing a suitable metric space in which to embed the networks. I will discuss a few potential metric spaces and discuss some initial findings regarding a toy model of an evolving network undergoing a limit cycle.

*Small Area Estimation Methods for Binary Variables in the Behavioral Risk Factor Surveillance System*

**October 23, 2013, 1:15pm – 2:15pm**

Room 150

Speaker: Neung Soo Ha

### Abstract

Large government administered surveys are designed to provide reliable estimates of finite population characteristics for large geographical regions such as the entire U.S. or each of the 50 states, but not for subpopulations and small geographical regions. In this talk, we use the 2010 Behavioral Risk Factor Surveillance System, BRFSS, to make inference at the county level for various population characteristics, such as totals, proportions, or quantiles, for health characteristics like the health insurance rate or the obesity rate. For our analysis, we use a Bayesian hierarchical model to combine individual survey data and area level auxiliary covariates. Our preliminary data analysis show that the use of parametric hierarchical methods may need to be extended to provide a satisfactory analysis.

*High-resolution imaging reconstruction from noisy radon data*

**October 30, 2013, 1:15pm – 2:15pm**

Room 150

Speaker: Wenjing Liao

### Abstract

Positron Emission Tomography (PET) and Single Photon Emission Computerized Tomography (SPECT) are the most widely used diagnostic imaging methods. The forward process is modeled by Radon transform and the noise is correctly modeled by Poisson process. Our goal is to develop an algorithm providing high resolution reconstructions from radon data corrupted by Poisson noise. I will present the state of art and our proposed approach in the talk.

*Logistic regression with fused lasso penalty, with an extension to non-convex penalties*

**November 6, 2013 – 12:50pm – 2:15pm**

Room 150

Speaker: Minh Pham

### Abstract

Array-based comparative hybridization is a popular tool to identify DNA copy number variations along the genome, then can be used as markers for prognosis and diagnosis of cancer. These CGH data sets fit in the p >> n scenario and it is possible to use techniques such as Lasso. However, Rapaport (2008) suggested that since arrayCGH profiles have particular structures of correlation between variables, we can use Fused Lasso penalty to enhance the performance of classification methods. In this seminar, I want to discuss a proximal point method to solve this structured penalty problem and extend it to non-convex penalties. Experiments with simulated data sets and arrayCGH data sets have shown that structured non-convex penalties can outperform Fused Lasso in the context of logistic regression.

*Leveraging spatial information in network modeling, with applications to bicycle sharing data*

**November 13, 2013, 1:15pm – 2:15pm**

Room 150

Speaker: Bailey Fosdick

### Abstract

Statistical analyses of social networks often focus on characterizing the relationship between the structure of network relations and individual-level and pairwise-level attributes of the nodes. An attribute relatively ignored in the past that has recently received much attention is the physical locations of nodes. One example of a social network that we might expect to exhibit strong spatial dependence is city bicycle sharing systems, where bike stations are the nodes and the number of trips between stations is the relation of interest. These transportation networks are interesting as the relationship between the network and physical space is likely very different than that for other social networks, such as friendships between people. In this talk we describe bicycle sharing data for Washington, D.C., discuss the issues with using traditional social network models to model the data, and propose potential alternative models that can account for the unique aspects of the data.

*An Asynchronous Scalable Distributed Expectation-Maximization Algorithm For Massive Data: The DEM Algorithm*

**November 20, 2013, 1:15pm – 2:15pm**

Room 150

Speaker: Sanvesh Srivastava

### Abstract

Massive data with complex latent structures have become common independent of discipline. The computer architectures to store these data are also rapidly evolving. Classical iterative statistical algorithms, such as Expectation-Maximization (EM), for fitting models with latent structures are practically infeasible for these data due to two main reasons: massive size of the data and the large number of parameters required to model the complex dependencies in the data. These two limitations are relaxed by the Distributed Iterative Statistical Computing (DISC) framework presented in this work for implementing iterative statistical algorithms by taking advantage of widely available computing power, such as cluster of computers. Using EM as a concrete example of an iterative algorithm, DISC extends and scales it for massive data as DISC-EM (DEM). We analyze the convergence properties of the sequence of parameter estimates generated by DEM and show that DEM retains the attractive properties of EM: monotone ascent of the log likelihood at each iteration and stability of iterations. DEM can also be easily implemented in cluster and grid computing environments using R package disc and existing EM implementations. To illustrate its application, we use DEM for estimating the effect of movie genres on their ratings in a movie ratings data.

*Robust Non-negative Matrix Factorization*

**December 4, 2013, 1:30pm – 2:30pm**

Room 150

Speaker: Kenny Lopiano

### Abstract

Principal Components Analysis (PCA) is a well known multivariate statistical method that has been used to decompose matrices in many applications. In many cases, however, matrices represent data sets that are characterized by strictly non-negative values. An alternative matrix decomposition method, non-negative matrix factorization (NMF), is a relatively modern method developed to obtain a low rank decomposition of matrices for the special case when all elements are non-negative. NMF decomposes a non-negative data matrix into non-negative linear combinations of non-negative basis vectors resulting in a parts-based structure that is more intuitive to interpret. To cope with the problem of outliers and contaminated matrices, we developed several robust non-negative matrix factorization algorithms. The algorithms simultaneously handle outliers and control the sparsity of the decomposition. Both fully automated and interactive versions of the algorithms are considered.

In this talk, I will review existing NMF algorithms and discuss the new robust NMF algorithms. The advantages and limitations of the algorithms will be discussed and illustrated using examples related to image compression and image cleaning. The talk is based on a joint work with Jiayang Sun (CWRU), Yifan Xu(CWRU) and Stanley Young (NISS).

## Professional Development Lunch

**December 11, 2013, 12:00pm – 1:30pm**

SAMSI Commons