## Spring Semester 2021

*Lecture: Method of Moments in Mixture Models*

**May 12, 2021 / 11:00am – 1:00pm **

Virtual

**Speaker: **Yun Wei , SAMSI

### Abstract

Method of moments, proposed in 1984 by Karl Pearson, is one of the most widely used methods in statistics for parameter estimation, by means of solving a system of equations that match the population and estimated moments. Following the work [Bruce Lindsay 1989], a denoised method of moments was recently proposed by [Yihong Wu and Pengkun Yang 2018] for Gaussian mixtures, which is proved to be minimax optimal. The key idea is to project the estimated moments onto the moment spaces of the corresponding mixing measures. We extend the denoised method of moments to mixture of families belonging to natural exponential families with quadratic variation, including Gaussian, Binomal, Gamma, Poisson, and Geometric distributions. The proposed algorithm is also proved to be minimax optimal.

This is based on ongoing work with Sayan Mukherjee.

*Lecture: Competition of Information in Online Social Media*

**May 5, 2021 / 11:00am – 1:00pm **

Virtual

**Speaker: **Diego Fregolente , SAMSI

### Abstract

Online social networks are rapidly complementing and even replacing person-to-person social contacts. With hundreds of millions of participants, they have become a perfect breeding ground for the spreading of misinformation and fake news. Through networks such as Twitter and Facebook, users are exposed daily to a large number of transmissible pieces of information that compete to spread widely. Such information flows have increasingly consequential implications for politics and policy, making the questions of discrimination and diversity more important in today’s online information networks than ever before. However, while one would expect the best ideas to prevail, empirical evidence suggests that high-quality information has no competitive advantage. We investigate this puzzling lack of discriminative power through an agent-based model that incorporates behavioral limitations in managing a heavy flow of information and measures the relationship between the quality of an idea and its likelihood to become prevalent at the system level. We show that both information overload and limited attention contribute to a degradation in the system’s discriminative power. A good tradeoff between discriminative power and diversity of information is possible according to the model. However, calibration with empirical data characterizing information load and finite attention in real social media reveals a weak correlation between quality and popularity of information. In these realistic conditions, the model provides an interpretation for the high volume of viral misinformation we observe online.

*Lecture: Informative Sensitivity Analysis for Direct Effects in Clinical Trials*

**April 28, 2021 / 11:00am – 1:00pm **

Virtual

**Speakers:** Jaffer Zaidi, SAMSI

### Abstract

The analysis of direct effects has a long history in statistics. The survivor average causal effect and natural direct effect are commonly identified with more assumptions than those guaranteed by the design of a randomized clinical trial. Practical and causally interpretable sensitivity analysis for individual level direct effects is also developed and illustrated for discrete and continuous outcomes. Our sensitivity analysis enables statisticians to precisely and effectively incorporate clinical and regulatory judgement to make sound clinical decisions. Our methodology is the first and, as of yet, only proposed procedure that enables quantifying individual level causal effects in the presence of truncation by death and/or informative censoring using only the assumptions that are guaranteed by design of the randomized clinical trial.

*Lecture: Are Deep Learning Models Superior for Missing Data Imputation in Large Surveys? Evidence from an Empirical Comparison*

**April 21, 2021 / 11:00am – 1:00pm **

Virtual

**Speakers:** Jason Poulus, SAMSI

### Abstract

Multiple imputation (MI) is the state-of-the-art approach for dealing with missing data arising from non-response in sample surveys. Multiple imputation by chained equations (MICE) is the most widely used MI method, but it lacks theoretical foundation and is computationally intensive. Recently, MI methods based on deep learning models have been developed with encouraging results in small studies. However, there has been limited research on systematically evaluating their performance in realistic settings compared to MICE, particularly in large-scale surveys. This paper provides a general framework for using simulations based on real survey data and several performance metrics to compare MI methods. We conduct extensive simulation studies based on the American Community Survey data to compare repeated sampling properties of four machine learning based MI methods: MICE with classification trees, MICE with random forests, generative adversarial imputation network, and multiple imputation using denoising autoencoders. We find the deep learning based MI methods dominate MICE in terms of computational time; however, MICE with classification trees consistently outperforms the deep learning MI methods in terms of bias, mean squared error, and coverage under a range of realistic settings.

*Lecture: Deep Spectral Q-learning in Infinite Horizon*

**April 14, 2021 / 11:00am – 1:00pm **

Virtual

**Speakers:** Yuhe Gao, SAMSI RA

### Abstract

Dynamic treatment regime assigns personalized treatments to subjects in a multi-stage decision process. In a clinical setting, such regimes will be developed based on collected patient information over a long period of study. The covariates of patients involved in decision making often include data collected at different frequencies. In this study, we propose a deep spectral Q-learning framework, where a Principal Component Analysis (PCA) based method is employed to handle the mixed frequency data together with a deep Q-learning model. The estimated regime obtained by this method is proved to converge to the optimal regime. Simulation studies show the advantages of this proposed method. We also demonstrate this method on a diabetes dataset.

*Lecture: Comparing Effects Across Linear Regression Models using Structural Equation Modeling*

**April 7, 2021 / 11:00am – 1:00pm **

Virtual

**Speakers:** Adam Lilly, SAMSI RA

### Abstract

It is common for sociologists to ask how the effect of focal independent variable changes after a variable or group of variables is added to a linear regression. One way to obtain confidence intervals and/or test the statistical significance of this change is to use seemingly unrelated estimation to combine estimates from the two regressions. This is often not done in practice. Researchers regularly interpret the change in the statistical significance of the focal independent variable even though this approach is ill-advised. Mediation analysis conducted in the structural equation modeling (SEM) framework is an improvement, as the analyst can estimate the direct effect of the focal independent variable as well as its indirect effect transmitted through a mediator. It is known that in linear models, the indirect effect is equivalent to the change in the effect of the independent variable after including the mediator in a regression. Taking advantage of the statistical equivalence between mediation and confounding, I demonstrate that SEM can also be used to estimate the change in a focal independent variable even when a covariate is thought to be a confounder rather than a mediator. Confidence intervals for the change are easily obtained in most SEM software using either the delta method or bootstrapping procedures, and the approach can handle situations where a group of covariates is simultaneously introduced. I also discuss some potential extensions to this work in progress.

*Lecture: Combinatorial Probability Program – Opening Workshop Highlights*

**March 31, 2021 / 11:00am – 1:00pm **

Virtual

**Speakers:** Yun Wei, SAMSI Postdoc and Miheer Dewaskar, SAMSI RA

### Abstract

This seminar will cover two of the Program on Combinatorial Probability’s working groups:

- Yun Wei, SAMSI Postdoc, will speak about Random Simplicial Complex models, led by Sayan Mukherjee of Duke University.
- Miheer Dewaskar, SAMSI RA, will speak about the Phase Transitions and Algorithms working group.

*Lecture: Targeted Stochastic Gradient Markov Chain Monte Carlo for Hidden Markov Models with Rare States*

**March 24, 2021 / 11:00am – 1:00pm **

Virtual

**Speaker:** Deborshee Sen, SAMSI

### Abstract

Markov chain Monte Carlo (MCMC) algorithms for hidden Markov models often rely on the forward-backward sampler. This makes them computationally slow as the length of the time series increases, which has motivated recent development of sub-sampling based approaches for posterior inference. These make use of the mixing of the hidden Markov process to approximate the full posterior by using small minibatches of the data, an idea related to stochastic gradient MCMC. In the presence of imbalanced data resulting from rare latent states, minibatches often exclude rare state data, leading to inaccurate inference and prediction/detection of rare events. We propose a targeted sub-sampling approach that over-samples the rare states when calculating the stochastic gradient of parameters associated with them. Our approach uses an initial clustering on the entire observation sequence to construct minibatch weights that minimize the variance in gradient estimation within stochastic gradient MCMC. This leads to improved sampling efficiency, in particular in settings where the rare states correspond to extreme observations. We demonstrate substantial gains in predictive and inferential accuracy on real and synthetic examples.

*Lecture: Gaussian Process Subspace Regression*

**March 17, 2021 / 11:00am – 1:00pm **

Virtual

**Speaker:** Ruda Zhang, SAMSI

### Abstract

We propose a matrix-valued Gaussian process to estimate subspace-valued functions. This is motivated by parametric reduced-order modeling (PROM), where reduced bases of a function space are computed at a sample of parameter points, and one wants to estimate the reduced basis at new parameter points. Our method is extrinsic and intrinsic at the same time: with a class of multivariate Gaussian distributions on the Euclidean space, we induce a joint probability model on the Grassmann manifold, which has analogous properties to the multivariate Gaussian. The Gaussian process adopts a simple yet general correlation structure, and the prediction admits a compact summary of uncertainty and an efficient sampling strategy. For problems in model order reduction, our method can give a probabilistic prediction at a new parameter point, at about the same cost of computing a reduced basis at an existing parameter point.

*Lecture: Proteins and Persistence*

**March 10, 2021 / 11:00am – 1:00pm **

Virtual

**Speaker:** Wesley Hamilton, UNC CH

### Abstract

Protein fold classification is a classic problem in structural biology and bioinformatics. In this talk, I’ll present on recent work using persistent homology to quantify and understand the space of protein data objects. We compare quantifications based on persistent homology to other knot-theoretic quantifications and show that persistence picks out something quantitatively different using the AJIVE statistical tool. I’ll also present on ongoing work incorporating ideas from multi-parameter persistence to the analysis of protein structure. This is joint work with J.S. Marron, J.E. Borgert, and T. Hamelryck.

*Lecture: Dense Weighted Networks Featuring Communities with Augmented Degree Correction*

**March 3, 2021 / 11:00am – 1:00pm **

Virtual

**Speaker:** Benjamin Leinwand, UNC CH

### Abstract

Dense networks with weighted connections often exhibit a community-like structure, where although most nodes are connected to each other, different patterns of edge weights may emerge depending on each node’s community membership. We propose a new framework for generating and estimating dense, weighted networks with potentially different connectivity patterns across different communities. The proposed model relies on a particular class of functions that map individual node characteristics to the edges connecting those nodes, allowing for flexibility while requiring a relatively small number of parameters relative to the number of edges. By leveraging the estimation techniques, we also develop a bootstrap methodology for generating new networks on the same set of vertices, which may be useful in circumstances where multiple data sets cannot be collected.

*Lecture: Sufficient Dimension Reduction in Time Series*

**February 24, 2021 / 11:00am – 1:00pm **

**(Meet and Greet with Postdocs and Grad Students 10:30am-11:00am)**

Virtual

**Speaker:** Seyed Samadi, Southern Illinois University

### Abstract

Dimensionality reduction has always been one of the most important and challenging problems in high-dimensional data analysis. In the context of time series analysis, we are interested in estimating the conditional mean and variance functions. Using the central and central mean subspaces, that preserve sufficient information about the response, one can estimate the unknown mean and variance functions. There are different approaches in the literature to estimate the time series central mean subspace (TS-CMS). However, those methods are computationally intensive and not feasible in practice. In this talk, we describe the Fourier transformation technique that is used to estimate the TS-CMS. The proposed estimators are shown to be consistent and asymptotically normal under some mild conditions. Simulation results and a real data analysis will be presented to demonstrate the performance of our methodology and compare it with existing methods.

***Working Group Summary in the Data Science in the Social and Behavioral Sciences program at 1:15pm:**

- Adam Lilly (Causality Traditions)
- Austin Ferguson (Psych network and viewpoint modeling)
- Ed Tam (Causal Networks)

*Lecture: Hypothesis Testing in Nonlinear Function on Scalar Regression with Application to Child Growth Study*

**February 17, 2021 / 1:15pm – 2:15pm**

Virtual

**Speaker:** Mityl Biswas, North Carolina State University

### Abstract

We propose a kernel machine-based hypothesis testing procedure in nonlinear function-on-scalar regression model. Our research is motivated by the Newborn Epigenetic Study (NEST) where the question of interest is whether a pre-specified group of toxic metals is associated with child growth. We take the child growth trajectory as the functional response, and model the toxic metal measurements jointly using a nonlinear function. We use a kernel machine approach to model the unknown function and transform the hypothesis of no effect to an appropriate variance components test. We demonstrate our proposed methodology using a simulation study and by applying it to analyze the NEST data.

*Lecture: Theory of Deep Convolutional Neural Networks*

**February 10, 2021 / 11:00am – 1:00pm **

**(Meet and Greet with Postdocs and Grad Students 10:30am-11:00am)**

Virtual

**Speaker:** Ding-Xuan Zhou, City University of Hong Kong and SAMSI

### Abstract

Deep learning has been widely applied and brought breakthroughs in speech recognition, computer vision, natural language processing, and many other domains. The involved deep neural network architectures and computational issues have been well studied in machine learning. But there lacks a theoretical foundation for understanding the modelling, approximation, or generalization ability of deep learning models with network architectures. Here we are interested in deep convolutional neural networks (CNNs) with convolutional structures. The convolutional architecture gives essential differences between deep CNNs and fully-connected neural networks, and the classical approximation theory for fully-connected networks developed around 30 years ago does not apply. This talk describes an approximation theory of deep CNNs associated with the rectified linear unit (ReLU) activation function. In particular, we prove the universality of deep CNNs, meaning that a deep CNN can be used to approximate any continuous function to an arbitrary accuracy when the depth of the neural network is large enough. We also show that deep CNNs perform at least as well as fully-connected neural networks for approximating general functions, and much better for approximating radial functions in high dimensions.

***Working Group Summary in the Data Science in the Social and Behavioral Sciences program at 1:15pm:**

- Benjamin Leinwand (Networks of Networks working group)
- Alejandro Martinez (Simulation to Understand working group)
- Wesley Hamilton (Brain Networks working group)

*Lecture: Modeling and Parameter Subset Selection for Fibrin Matrix Polymerization Kinetics *

**February 7, 2021 / 1:15pm – 2:15pm**

Virtual

**Speaker:** Katherine Pearce, SAMSI RA

### Abstract

In this talk, we present a model for fibrin matrix polymerization in a biomimetic wound healing application and an algorithm for parameter subset selection. We first contextualize the model by briefly summarizing the relevant details of hemostasis, the initial stage of wound healing in which there is rapid accumulation of fibrin matrix (our primary QoI). We then discuss the associated chemical reaction network and idealized ODE system derivation, as well as relevant conservation laws and constraints. An optimization problem is subsequently formulated for (reaction rate) parameter estimation, and we illustrate our iterative procedure that, within each iteration: (i) determines candidates for non-identifiability, (ii) fixes non-identifiable parameters at nominal values, and (iii) via this procedure, decreases the magnitude of the least squares residual objective function. Lastly, we describe how our procedure may be used for model reduction through the elimination of reactions associated with the most non-identifiable rate constants.

*Lecture: Finding Significant Communities in Cross-correlation Networks Derived from Multi-view Data*

**January 27, 2021 / 1:15pm – 2:15pm**

Virtual

**Speaker:** Miheer Dewaskar, SAMSI RA

### Abstract

Multi-view data, obtained by taking multiple types of measurements of the same samples, is now common in many scientific disciplines like genomics, ecology, and climate science. An important exploratory problem in the analysis of such data is to identify interactions between features from the different measurement types. Assuming linear association, these interactions can be captured by a bipartite cross-correlation network, whose nodes are features from two distinct measurement types and edges represent feature pairs that are truly correlated. We will introduce the Bimodule Search Procedure (BSP), which performs repeated hypothesis tests on the data to find communities in the bipartite cross-correlation network and demonstrate its application to the problem of eQTL analysis in genomics. BSP works directly with the data rather than only the sample correlation matrices, and it is thus able to borrow strength and account for correlations within features of the same measurement type. We will also briefly discuss some theoretical questions motivated by community detection in correlation networks and other correlation network mining procedures.

Multi-view data, obtained by taking multiple types of measurements of the same samples, is now common in many scientific disciplines like genomics, ecology, and climate science. An important exploratory problem in the analysis of such data is to identify interactions between features from the different measurement types. Assuming linear association, these interactions can be captured by a bipartite cross-correlation network, whose nodes are features from two distinct measurement types and edges represent feature pairs that are truly correlated. We will introduce the Bimodule Search Procedure (BSP), which performs repeated hypothesis tests on the data to find communities in the bipartite cross-correlation network and demonstrate its application to the problem of eQTL analysis in genomics. BSP works directly with the data rather than only the sample correlation matrices, and it is thus able to borrow strength and account for correlations within features of the same measurement type. We will also briefly discuss some theoretical questions motivated by community detection in correlation networks and other correlation network mining procedures.

## Fall Semester 2020

*Lecture: Advancements in Hybrid Iterative Methods for Inverse Problems*

**September 16, 2020, 1:15pm – 2:15pm**

Virtual

**Speaker:** Julianne Chung, Virginia Tech University

### Abstract

In many physical systems, measurements can only be obtained on the exterior of an object (e.g., the human body or the earth’s crust), and the goal is to estimate the internal structures. In other systems, signals measured from machines (e.g., cameras) are distorted, and the aim is to recover the original input signal. These are natural examples of inverse problems that arise in fields such as medical imaging, astronomy, geophysics, and molecular biology.

Hybrid iterative methods are increasingly being used to solve large, ill-posed inverse problems, due to their desirable properties of (1) avoiding semi-convergence, whereby later reconstructions are no longer dominated by noise, and (2) enabling adaptive and automatic regularization parameter selection. In this talk, we describe some recent advancements in hybrid iterative methods for computing solutions to large-scale inverse problems. First, we consider a hybrid approach based on the generalized Golub-Kahan bidiagonalization for computing Tikhonov regularized solutions to problems where explicit computation of the square root and inverse of the covariance kernel for the prior covariance matrix is not feasible. This is useful for large-scale problems where covariance kernels are defined on irregular grids or are only available via matrix-vector multiplication. Second, we describe flexible hybrid methods for solving $\ell_p$ regularized inverse problems, where we approximate the p-norm penalization term as a sequence of 2-norm penalization terms using adaptive regularization matrices, and we exploit flexible preconditioning techniques to efficiently incorporate the weight updates. We introduce a flexible Golub-Kahan approach within a Krylov-Tikhonov hybrid framework, such that our approaches extend to general (non-square) l_p regularized problems. Numerical examples from dynamic photoacoustic tomography, space-time deblurring, and passive seismic tomography demonstrate the range of applicability and effectiveness of these approaches.

*Lecture: Sharp 2-norm Error Bounds for the Conjugate Gradient Method and LSQR*

**September 23, 2020, 1:15pm – 2:15pm**

Virtual

**Speaker:** Eric Hallman, North Carolina State University

### Abstract

When running any iterative algorithm it is useful to know when to stop. Here we review the conjugate gradient method, an iterative method for solving Ax=b where A is symmetric positive definite, as well as estimates for the 2-norm error \|x-x_*\|_2, where x_* is the solution to the linear system. We introduce a new method for computing an upper bound on the 2-norm error, and show that given certain mild assumptions our bounds are optimal. Experimental results are discussed as well as the implications of our work for solving the least-squares problem \min_x \|Ax-b\| using the iterative algorithm LSQR.

*Lecture: Convergence of the Parameters in Mixture Models with Repeated Measurements*

**September 30, 2020, 1:15pm – 2:15pm**

Virtual

**Speaker:** Yun Wei, SAMSI

### Abstract

Latent structure models with many observed variables are among the most powerful and widely used tools in statistics for learning about heterogeneity within data population(s). An important canonical example of such models is the mixture of product distributions. We consider the finite mixture of product distribution with the special structure that the product distributions in each mixture component are also identically distributed. In this setup, each mixture component consists of samples from repeated measurements and thus such data are exchangeable sequences. Applications of the model include psychological study and topic modeling.

We show that with sufficient repeated measurements, a model that is not originally identifiable becomes identifiable. The posterior contraction rate for the parameter estimation is also obtained and it shows that repeated measurements are beneficial for estimating parameters in each mixture component. Such results hold for general probability kernels including all regular exponential families and can be applied to hierarchical models.

Based on joint work with Xuanlong Nguyen.

*Lecture: Randomized Approaches to Accelerate MCMC Algorithms for Bayesian Inverse Problems*

**October 7, 2020, 1:15pm – 2:15pm**

Virtual

**Speaker:** Arvind Saibaba, North Carolina State University

### Abstract

Markov chain Monte Carlo (MCMC) approaches are traditionally used for uncertainty quantification in inverse problems where the physics of the underlying sensor modality is described by a partial differential equation (PDE). However, the use of MCMC algorithms is prohibitively expensive in applications where each log-likelihood evaluation may require hundreds to thousands of PDE solves corresponding to multiple sensors; i.e., spatially distributed sources and receivers perhaps operating at different frequencies or wavelengths depending on the precise application. In this talk, I will show how to mitigate the computational cost of each log-likelihood evaluation by using several randomized techniques and embed these randomized approximations within MCMC algorithms. These MCMC algorithms are computationally efficient methods for quantifying the uncertainty associated with the reconstructed parameters. We demonstrate the accuracy and computational benefits of our proposed algorithms on a model application from diffuse optical tomography where we invert for the spatial distribution of optical absorption.

*Lecture: Individual Level Always Survivor, Direct, Spillover Effects with Applications*

**October 14, 2020, 1:15pm – 2:15pm **

Virtual

**Speaker:** Jaffer Zaidi, SAMSI

### Abstract

We provide investigators with the ability to quantify individual level always survivor, direct, and spillover effects. The survivor average causal effect is commonly identified with more assumptions than those guaranteed by the design of a randomized clinical trial. This paper demonstrates that individual level causal effects in the `always survivor’ principal stratum can be identified with no stronger identification assumptions than randomization. We illustrate the practical utility of our methods using data from a clinical trial on patients with prostate cancer. We also provide another application on the spillover effects of randomized get out the vote campaigns. Our methodology is the first and, as of yet, only proposed procedure that enables detecting individual level causal effects in the presence of truncation by death using only the assumptions that are guaranteed by design of the clinical trial.

*Lecture: Scalable Bayesian Inference for Time Series via Divide-and-conquer*

** October 21, 2020 1:15pm – 2:15pm **

Virtual

**Speaker:** Deborshee Sen, SAMSI

### Abstract

Bayesian computational algorithms tend to scale poorly as the size of data increases. This had led to the development of divide-and-conquer-based approaches for scalable inference. These divide the data into chunks, perform inference for each chunk in parallel, and then combine these inferences. While appealing theoretical properties and practical performance has been demonstrated for independent observations, scalable inference for dependent data remains challenging. In this work, we study the problem of Bayesian inference from very long time series. The literature in this area focuses mainly on approximate approaches that lack any theoretical guarantees and may provide arbitrarily poor accuracy in practice. We propose a simple and scalable divide-and-conquer algorithm, and provide accuracy guarantees. Numerical simulations and real data applications demonstrate the effectiveness of our approach.

*Lecture: Probabilistic Learning on Manifolds*

** October 28, 2020 1:15pm – 2:15pm **

Virtual

**Speaker:** Ruda Zhang, SAMSI

### Abstract

Probabilistic models of data sets often exhibit salient geometric structure. Such a phenomenon is summed up in the manifold distribution hypothesis, and can be exploited in probabilistic learning tasks such as density estimation and generative modeling. In this talk I present a framework for probabilistic learning on manifolds (PLoM), which uses manifold learning to discover low-dimensional structures within high-dimensional data, and exploits topological properties of the learned manifold to efficiently build probabilistic models. A joint distribution is partitioned into a marginal distribution on the manifold and conditional distributions on normal spaces of the manifold. The marginal distribution can be estimated using Riemannian kernels, and the conditional distributions can be estimated discretely by normal-bundle bootstrap or continuously using Gaussian kernels. Combining the marginal and conditional models gives a joint generative model. I will also talk about related algorithms and software development, and potential applications.

*Lecture: Competition and Spreading of Low and High-Quality Information in Online Social Networks*

** November 4, 2020 1:15pm – 2:15pm **

Virtual

**Speaker:** Diego Fregolente, SAMSI

### Abstract

The advent of online social networks as major communication platforms for the exchange of information and opinions is having a significant impact on our lives by facilitating the sharing of ideas. Through networks such as Twitter and Facebook, users are exposed daily to a large number of transmissible pieces of information that compete to attain success. Such information flows have increasingly consequential implications for politics and policy, making the questions of discrimination and diversity more important in today’s online information networks than ever before. However, while one would expect the best ideas to prevail, empirical evidence suggests that high-quality information has no competitive advantage. We investigate this puzzling lack of discriminative power through an agent-based model that incorporates behavioral limitations in managing a heavy flow of information and measures the relationship between the quality of an idea and its likelihood to become prevalent at the system level. We show that both information overload and limited attention contribute to a degradation in the system’s discriminative power. A good tradeoff between discriminative power and diversity of information is possible according to the model. However, calibration with empirical data characterizing information load and finite attention in real social media reveals a weak correlation between quality and popularity of information. In these realistic conditions, the model provides an interpretation for the high volume of viral misinformation we observe online.

*Lecture: Retrospective Causal Inference via Matrix Completion, with an Evaluation of the Effect of European Integration on Labour Market Outcomes*

** November 11, 2020 1:15pm – 2:15pm **

Virtual

**Speaker:** Jason Poulus, SAMSI

### Abstract

We propose a method of *retrospective* counterfactual prediction in panel data settings with units exposed to treatment after an initial time period (later-treated), and always-treated units, but no never-treated units. We invert the standard setting by using the observed post-treatment outcomes to predict the counterfactual pre-treatment potential outcomes under treatment for the later-treated units. We impute the missing outcomes via a matrix completion estimator with a propensity- and elapsed-time weighted objective function that corrects for differences in the covariate distributions and elapsed time since treatment between groups. Our methodology is motivated by evaluating the effect of two milestones of European integration on the share of cross-border workers in sending border regions. We provide evidence that opening the border increased the probability of working beyond the border in Eastern European regions.