## Spring Semester 2021

*Lecture: TBD*

**March 10, 2021 / 11:00am – 1:00pm **

Virtual

**Speaker:** Wesley Hamilton, UNC CH

### Abstract

TBD

*Lecture: Dense Weighted Networks Featuring Communities with Augmented Degree Correction*

**March 3, 2021 / 11:00am – 1:00pm **

Virtual

**Speaker:** Benjamin Leinwand, UNC CH

### Abstract

Dense networks with weighted connections often exhibit a community-like structure, where although most nodes are connected to each other, different patterns of edge weights may emerge depending on each node’s community membership. We propose a new framework for generating and estimating dense, weighted networks with potentially different connectivity patterns across different communities. The proposed model relies on a particular class of functions that map individual node characteristics to the edges connecting those nodes, allowing for flexibility while requiring a relatively small number of parameters relative to the number of edges. By leveraging the estimation techniques, we also develop a bootstrap methodology for generating new networks on the same set of vertices, which may be useful in circumstances where multiple data sets cannot be collected.

*Lecture: Sufficient Dimension Reduction in Time Series*

**February 24, 2021 / 11:00am – 1:00pm **

**(Meet and Greet with Postdocs and Grad Students 10:30am-11:00am)**

Virtual

**Speaker:** Seyed Samadi, Southern Illinois University

### Abstract

Dimensionality reduction has always been one of the most important and challenging problems in high-dimensional data analysis. In the context of time series analysis, we are interested in estimating the conditional mean and variance functions. Using the central and central mean subspaces, that preserve sufficient information about the response, one can estimate the unknown mean and variance functions. There are different approaches in the literature to estimate the time series central mean subspace (TS-CMS). However, those methods are computationally intensive and not feasible in practice. In this talk, we describe the Fourier transformation technique that is used to estimate the TS-CMS. The proposed estimators are shown to be consistent and asymptotically normal under some mild conditions. Simulation results and a real data analysis will be presented to demonstrate the performance of our methodology and compare it with existing methods.

***Working Group Summary in the Data Science in the Social and Behavioral Sciences program at 1:15pm:**

- Adam Lilly (Causality Traditions)
- Austin Ferguson (Psych network and viewpoint modeling)
- Ed Tam (Causal Networks)

*Lecture: Hypothesis Testing in Nonlinear Function on Scalar Regression with Application to Child Growth Study*

**February 17, 2021 / 1:15pm – 2:15pm**

Virtual

**Speaker:** Mityl Biswas, North Carolina State University

### Abstract

We propose a kernel machine-based hypothesis testing procedure in nonlinear function-on-scalar regression model. Our research is motivated by the Newborn Epigenetic Study (NEST) where the question of interest is whether a pre-specified group of toxic metals is associated with child growth. We take the child growth trajectory as the functional response, and model the toxic metal measurements jointly using a nonlinear function. We use a kernel machine approach to model the unknown function and transform the hypothesis of no effect to an appropriate variance components test. We demonstrate our proposed methodology using a simulation study and by applying it to analyze the NEST data.

*Lecture: Theory of Deep Convolutional Neural Networks*

**February 10, 2021 / 11:00am – 1:00pm **

**(Meet and Greet with Postdocs and Grad Students 10:30am-11:00am)**

Virtual

**Speaker:** Ding-Xuan Zhou, City University of Hong Kong and SAMSI

### Abstract

Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, natural language processing, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoretical foundation for understanding the modelling, approximation, or generalization ability of deep learning models with network architectures. Here we are interested in deep convolutional neural networks (CNNs) with convolutional structures. The convolutional architecture gives essential differences between deep CNNs and fully-connected neural networks, and the classical approximation theory for fully-connected networks developed around 30 years ago does not apply. This talk describes an approximation theory of deep CNNs associated with the rectified linear unit (ReLU) activation function. In particular, we prove the universality of deep CNNs, meaning that a deep CNN can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. We also show that deep CNNs perform at least as well as fully-connected neural networks for approximating general functions, and much better for approximating radial functions in high dimensions.

***Working Group Summary in the Data Science in the Social and Behavioral Sciences program at 1:15pm:**

- Benjamin Leinwand (Networks of Networks working group)
- Alejandro Martinez (Simulation to Understand working group)
- Wesley Hamilton (Brain Networks working group)

*Lecture: Modeling and Parameter Subset Selection for Fibrin Matrix Polymerization Kinetics *

**February 7, 2021 / 1:15pm – 2:15pm**

Virtual

**Speaker:** Katherine Pearce, SAMSI RA

### Abstract

In this talk, we present a model for fibrin matrix polymerization in a biomimetic wound healing application and an algorithm for parameter subset selection. We first contextualize the model by briefly summarizing the relevant details of hemostasis, the initial stage of wound healing in which there is rapid accumulation of fibrin matrix (our primary QoI). We then discuss the associated chemical reaction network and idealized ODE system derivation, as well as relevant conservation laws and constraints. An optimization problem is subsequently formulated for (reaction rate) parameter estimation, and we illustrate our iterative procedure that, within each iteration: (i) determines candidates for non-identifiability, (ii) fixes non-identifiable parameters at nominal values, and (iii) via this procedure, decreases the magnitude of the least squares residual objective function. Lastly, we describe how our procedure may be used for model reduction through the elimination of reactions associated with the most non-identifiable rate constants.

*Lecture: Finding Significant Communities in Cross-correlation Networks Derived from Multi-view Data*

**January 27, 2021 / 1:15pm – 2:15pm**

Virtual

**Speaker:** Miheer Dewaskar, SAMSI RA

### Abstract

Multi-view data, obtained by taking multiple types of measurements of the same samples, is now common in many scientific disciplines like genomics, ecology, and climate science. An important exploratory problem in the analysis of such data is to identify interactions between features from the different measurement types. Assuming linear association, these interactions can be captured by a bipartite cross-correlation network, whose nodes are features from two distinct measurement types and edges represent feature pairs that are truly correlated. We will introduce the Bimodule Search Procedure (BSP), which performs repeated hypothesis tests on the data to find communities in the bipartite cross-correlation network and demonstrate its application to the problem of eQTL analysis in genomics. BSP works directly with the data rather than only the sample correlation matrices, and it is thus able to borrow strength and account for correlations within features of the same measurement type. We will also briefly discuss some theoretical questions motivated by community detection in correlation networks and other correlation network mining procedures.

Multi-view data, obtained by taking multiple types of measurements of the same samples, is now common in many scientific disciplines like genomics, ecology, and climate science. An important exploratory problem in the analysis of such data is to identify interactions between features from the different measurement types. Assuming linear association, these interactions can be captured by a bipartite cross-correlation network, whose nodes are features from two distinct measurement types and edges represent feature pairs that are truly correlated. We will introduce the Bimodule Search Procedure (BSP), which performs repeated hypothesis tests on the data to find communities in the bipartite cross-correlation network and demonstrate its application to the problem of eQTL analysis in genomics. BSP works directly with the data rather than only the sample correlation matrices, and it is thus able to borrow strength and account for correlations within features of the same measurement type. We will also briefly discuss some theoretical questions motivated by community detection in correlation networks and other correlation network mining procedures.

## Fall Semester 2020

*Lecture: Advancements in Hybrid Iterative Methods for Inverse Problems*

**September 16, 2020, 1:15pm – 2:15pm**

Virtual

**Speaker:** Julianne Chung, Virginia Tech University

### Abstract

In many physical systems, measurements can only be obtained on the exterior of an object (e.g., the human body or the earth’s crust), and the goal is to estimate the internal structures. In other systems, signals measured from machines (e.g., cameras) are distorted, and the aim is to recover the original input signal. These are natural examples of inverse problems that arise in fields such as medical imaging, astronomy, geophysics, and molecular biology.

Hybrid iterative methods are increasingly being used to solve large, ill-posed inverse problems, due to their desirable properties of (1) avoiding semi-convergence, whereby later reconstructions are no longer dominated by noise, and (2) enabling adaptive and automatic regularization parameter selection. In this talk, we describe some recent advancements in hybrid iterative methods for computing solutions to large-scale inverse problems. First, we consider a hybrid approach based on the generalized Golub-Kahan bidiagonalization for computing Tikhonov regularized solutions to problems where explicit computation of the square root and inverse of the covariance kernel for the prior covariance matrix is not feasible. This is useful for large-scale problems where covariance kernels are defined on irregular grids or are only available via matrix-vector multiplication. Second, we describe flexible hybrid methods for solving $\ell_p$ regularized inverse problems, where we approximate the p-norm penalization term as a sequence of 2-norm penalization terms using adaptive regularization matrices, and we exploit flexible preconditioning techniques to efficiently incorporate the weight updates. We introduce a flexible Golub-Kahan approach within a Krylov-Tikhonov hybrid framework, such that our approaches extend to general (non-square) l_p regularized problems. Numerical examples from dynamic photoacoustic tomography, space-time deblurring, and passive seismic tomography demonstrate the range of applicability and effectiveness of these approaches.

*Lecture: Sharp 2-norm Error Bounds for the Conjugate Gradient Method and LSQR*

**September 23, 2020, 1:15pm – 2:15pm**

Virtual

**Speaker:** Eric Hallman, North Carolina State University

### Abstract

When running any iterative algorithm it is useful to know when to stop. Here we review the conjugate gradient method, an iterative method for solving Ax=b where A is symmetric positive definite, as well as estimates for the 2-norm error \|x-x_*\|_2, where x_* is the solution to the linear system. We introduce a new method for computing an upper bound on the 2-norm error, and show that given certain mild assumptions our bounds are optimal. Experimental results are discussed as well as the implications of our work for solving the least-squares problem \min_x \|Ax-b\| using the iterative algorithm LSQR.

*Lecture: Convergence of the Parameters in Mixture Models with Repeated Measurements*

**September 30, 2020, 1:15pm – 2:15pm**

Virtual

**Speaker:** Yun Wei, SAMSI

### Abstract

Latent structure models with many observed variables are among the most powerful and widely used tools in statistics for learning about heterogeneity within data population(s). An important canonical example of such models is the mixture of product distributions. We consider the finite mixture of product distribution with the special structure that the product distributions in each mixture component are also identically distributed. In this setup, each mixture component consists of samples from repeated measurements and thus such data are exchangeable sequences. Applications of the model include psychological study and topic modeling.

We show that with sufficient repeated measurements, a model that is not originally identifiable becomes identifiable. The posterior contraction rate for the parameter estimation is also obtained and it shows that repeated measurements are beneficial for estimating parameters in each mixture component. Such results hold for general probability kernels including all regular exponential families and can be applied to hierarchical models.

Based on joint work with Xuanlong Nguyen.

*Lecture: Randomized Approaches to Accelerate MCMC Algorithms for Bayesian Inverse Problems*

**October 7, 2020, 1:15pm – 2:15pm**

Virtual

**Speaker:** Arvind Saibaba, North Carolina State University

### Abstract

Markov chain Monte Carlo (MCMC) approaches are traditionally used for uncertainty quantification in inverse problems where the physics of the underlying sensor modality is described by a partial differential equation (PDE). However, the use of MCMC algorithms is prohibitively expensive in applications where each log-likelihood evaluation may require hundreds to thousands of PDE solves corresponding to multiple sensors; i.e., spatially distributed sources and receivers perhaps operating at different frequencies or wavelengths depending on the precise application. In this talk, I will show how to mitigate the computational cost of each log-likelihood evaluation by using several randomized techniques and embed these randomized approximations within MCMC algorithms. These MCMC algorithms are computationally efficient methods for quantifying the uncertainty associated with the reconstructed parameters. We demonstrate the accuracy and computational benefits of our proposed algorithms on a model application from diffuse optical tomography where we invert for the spatial distribution of optical absorption.

*Lecture: Individual Level Always Survivor, Direct, Spillover Effects with Applications*

**October 14, 2020, 1:15pm – 2:15pm **

Virtual

**Speaker:** Jaffer Zaidi, SAMSI

### Abstract

We provide investigators with the ability to quantify individual level always survivor, direct, and spillover effects. The survivor average causal effect is commonly identified with more assumptions than those guaranteed by the design of a randomized clinical trial. This paper demonstrates that individual level causal effects in the `always survivor’ principal stratum can be identified with no stronger identification assumptions than randomization. We illustrate the practical utility of our methods using data from a clinical trial on patients with prostate cancer. We also provide another application on the spillover effects of randomized get out the vote campaigns. Our methodology is the first and, as of yet, only proposed procedure that enables detecting individual level causal effects in the presence of truncation by death using only the assumptions that are guaranteed by design of the clinical trial.

*Lecture: Scalable Bayesian Inference for Time Series via Divide-and-conquer*

** October 21, 2020 1:15pm – 2:15pm **

Virtual

**Speaker:** Deborshee Sen, SAMSI

### Abstract

Bayesian computational algorithms tend to scale poorly as the size of data increases. This had led to the development of divide-and-conquer-based approaches for scalable inference. These divide the data into chunks, perform inference for each chunk in parallel, and then combine these inferences. While appealing theoretical properties and practical performance has been demonstrated for independent observations, scalable inference for dependent data remains challenging. In this work, we study the problem of Bayesian inference from very long time series. The literature in this area focuses mainly on approximate approaches that lack any theoretical guarantees and may provide arbitrarily poor accuracy in practice. We propose a simple and scalable divide-and-conquer algorithm, and provide accuracy guarantees. Numerical simulations and real data applications demonstrate the effectiveness of our approach.

*Lecture: Probabilistic Learning on Manifolds*

** October 28, 2020 1:15pm – 2:15pm **

Virtual

**Speaker:** Ruda Zhang, SAMSI

### Abstract

Probabilistic models of data sets often exhibit salient geometric structure. Such a phenomenon is summed up in the manifold distribution hypothesis, and can be exploited in probabilistic learning tasks such as density estimation and generative modeling. In this talk I present a framework for probabilistic learning on manifolds (PLoM), which uses manifold learning to discover low-dimensional structures within high-dimensional data, and exploits topological properties of the learned manifold to efficiently build probabilistic models. A joint distribution is partitioned into a marginal distribution on the manifold and conditional distributions on normal spaces of the manifold. The marginal distribution can be estimated using Riemannian kernels, and the conditional distributions can be estimated discretely by normal-bundle bootstrap or continuously using Gaussian kernels. Combining the marginal and conditional models gives a joint generative model. I will also talk about related algorithms and software development, and potential applications.

*Lecture: Competition and Spreading of Low and High-Quality Information in Online Social Networks*

** November 4, 2020 1:15pm – 2:15pm **

Virtual

**Speaker:** Diego Fregolente, SAMSI

### Abstract

The advent of online social networks as major communication platforms for the exchange of information and opinions is having a significant impact on our lives by facilitating the sharing of ideas. Through networks such as Twitter and Facebook, users are exposed daily to a large number of transmissible pieces of information that compete to attain success. Such information flows have increasingly consequential implications for politics and policy, making the questions of discrimination and diversity more important in today’s online information networks than ever before. However, while one would expect the best ideas to prevail, empirical evidence suggests that high-quality information has no competitive advantage. We investigate this puzzling lack of discriminative power through an agent-based model that incorporates behavioral limitations in managing a heavy flow of information and measures the relationship between the quality of an idea and its likelihood to become prevalent at the system level. We show that both information overload and limited attention contribute to a degradation in the system’s discriminative power. A good tradeoff between discriminative power and diversity of information is possible according to the model. However, calibration with empirical data characterizing information load and finite attention in real social media reveals a weak correlation between quality and popularity of information. In these realistic conditions, the model provides an interpretation for the high volume of viral misinformation we observe online.

*Lecture: Retrospective Causal Inference via Matrix Completion, with an Evaluation of the Effect of European Integration on Labour Market Outcomes*

** November 11, 2020 1:15pm – 2:15pm **

Virtual

**Speaker:** Jason Poulus, SAMSI

### Abstract

We propose a method of *retrospective* counterfactual prediction in panel data settings with units exposed to treatment after an initial time period (later-treated), and always-treated units, but no never-treated units. We invert the standard setting by using the observed post-treatment outcomes to predict the counterfactual pre-treatment potential outcomes under treatment for the later-treated units. We impute the missing outcomes via a matrix completion estimator with a propensity- and elapsed-time weighted objective function that corrects for differences in the covariate distributions and elapsed time since treatment between groups. Our methodology is motivated by evaluating the effect of two milestones of European integration on the share of cross-border workers in sending border regions. We provide evidence that opening the border increased the probability of working beyond the border in Eastern European regions.