# 2010-11 Program on Analysis of Object Data

The 12-month SAMSI program focused on the analysis of complex data types that are an extension of Functional Data Analysis where one considers methods to analyze data samples of complex objects. Modern science is generating a need to understand, and statistically analyze, populations of increasingly complex types. The term "Analysis of Object Data" (AOD) is aimed at encompassing a broad array of such methods. The program sought to bring together a diverse group of researchers (from statistics, other parts of mathematics, and related sciences) to explore the common structure that underlies such methodologies, and to use this knowledge in turn to motivate and synthesize new approaches.

*Organizing Committee:*

**Program Leaders:** Hans-Georg Müller (Univ. of California-Davis), Jane-Ling Wang (University of California-Davis), Ian Dryden (University of South Carolina), Jim Ramsay (McGill)**Local Scientific Coordinator:** Steve Marron (UNC-CH)**Directorate Liaison:** Nell Sedransk (NISS and SAMSI)**National Advisory Committee Liaison:** Jianqing Fan (Princeton)

### Research Foci

AOD extends the very active research area of Functional Data Analysis and generalizes the fundamental FDA concept of *curves as data points*, to the more general concept of *objects as data points*. Examples included images, shapes of objects in 3D, points on a manifold, tree structured objects, and various types of movies. Specific AOD contexts can be grouped in a number of interesting ways. A grouping of perhaps mathematical interest is considered first. This is in terms of the type of space in which the data objects lie:

- Euclidean, i.e., (constant length) vectors of real numbers.
- Mildly non-Euclidean, i.e. points on a manifold and shapes.
- Strongly non-Euclidean, i.e. tree or graph structured objects.

**Euclidean Objects**

Euclidean data objects are quite ubiquitous in a variety of AOD contexts. One focus was on *Functional Data Analysis* (FDA), viewing curves as data. These curves are commonly either simply digitized, or else decomposed by a basis expansion, which gives a vector that represents each data curve. Evolutionary biology and longitudinal applications will be important drivers of the FDA and shape analysis considered in this program.

A second focus was *Time Dynamics Data*, with an emphasis on differential equations and dynamic systems as drivers of fully or incompletely observed samples of stochastic processes. This also included point and marked point processes as data objects. Applications can be found in control, engineering, biological modeling of growth or cell kinetics and in e-commerce, where the analysis of auction dynamics is of great interest. In the social sciences repeated events such as child births of a woman and in medical studies, the dynamics of HIV infections, and the dynamics of gene expression and relations with gene networks are examples.

**Mildly non-Euclidean Objects**

One research focus was *Shape Analysis and Manifold Data*, where for example 2 or 3 dimensional locations of a set of common landmarks are collected into vectors that represent shapes. While these vectors are just standard multivariate data, they frequently violate standard multivariate assumptions, such as the sample size being (usually much) larger than the dimension. Research in the direction of High Dimension Low Sample Size (HDLSS) issues will be a major emphasis of the proposed SAMSI program. In addition, the landmarks may be invariant to certain transformations such as location, rotation and scale, and Kendall's shape analysis of such objects leads to non-Euclidean distances being the most natural. Further recent examples include analysis of shapes of unlabeled points, especially on curves, surfaces and images. The closely related manifold data also are based on non-Euclidean distances.

Data which naturally lie in a manifold have been in the statistical literature for some time in the form of directional data (data points which are circular or spherical angles) and play an increasingly important role for the analysis of landmarks.

A second research focus was Modern *Image Analysis*, that is applications where the data consist of a sample of images will be another program focus. Such data can be often understood as being located on manifolds. These include medial representations for shape objects (involving a mix of real numbers and angles as parameters), diffusion tensor imaging (a branch of magnetic resonance imaging, which represents directionality of fluid flow using tensors), and diffeo-morphisms (a powerful mathematical approach to studying warpings of space that address non-affine registration challenges.). While manifold data present major statistical challenges (because most statistical methods are very Euclidean in nature), they are termed "mildly non-Euclidean", because manifolds admit tangent plane approximation, so that (at least when the data are sufficiently concentrated near the point of tangency) approximate Euclidean methods have been employed to good effect. A wide open research area, that will be a major focus on the SAMSI program, is the development of "intrinsic" methodologies, where the statistical analysis is carried out really inside the manifold, which thus avoid distortion problems for manifold data that are not concentrated in a small area.

**Strongly non-Euclidean Objects**

Objects such as *Tree and Graph Structured Data* are "strongly non-Euclidean", because the data space admits no tangent plane approximation. Thus, there is no apparent approach to adapting even approximate Euclidean methodologies, and statistical analysis must be invented from the ground up. The first workable methodology of this type appears in Aydin et al (2008). But this field is in its infancy, with large potential as a context for the development of new ideas, and became another focus of the SAMSI program.

**Statistical and Mathematical Challenges**

The mathematical areas involved in AOD highlighted the potential synergies that are possible. These included:

- Statistics - this is a common theme to all parts of the proposal. Statistics itself as a discipline will be benefitted through the invention of new ways of understanding statistical methods. A clear example of this will be HDLSS asymptotics, which are anticipated to both inform, and be driven by, the methodological component of the program.
- Optimization - in most contexts above (especially manifold and tree structured data) statistical ideas result in optimization problems that can be very challenging to solve. This is anticipated to lead to the development of new ideas for addressing optimization problems. Furthermore, the SAMSI collaboration is intended to lead to a deeper interaction between statisticians and optimizers at all stages of the method development.
- Geometry - there are major geometric challenges, especially in the area of manifold data. The SAMSI program will seek to move beyond the current mode of "statisticians using geometric ideas", to serious collaboration between statisticians and geometers, again at all stages of method development, seeking connections with the emerging fields of computational topology and metric geometry.
- Probability - there were very early strong connections between statistics and probability that have languished somewhat recently. This program will provide an opportunity to replenish this link between areas. In particular, important open questions are the development of appropriate, e.g. "normal" probability distributions for data lying on manifolds, or tree structured data.
- Differential Equations - As noted in Ramsay and Silverman (2002) there already has been strong application of differential equation ideas in FDA. Another important interface is that a very promising approach to the generation of "normal" distributions on exotic space, is the heat diffusion equation approach. Finally, dynamical systems have become a very active research area in the modeling of biological and other temporal and spatio-temporal phenomena and there exists a natural link with functional data analysis methodology that has not been explored yet. Developing this link will lead to better understanding of such systems and new directions for AOD.
- Topology - an emerging new statistical field is topological data analysis, which seeks to understand structure in very high dimensions, via reducing high dimensional density estimates to focus on informative topological aspects.

**Potential Applications**

The applications areas to be emphasized depended upon the program participants themselves. The following list suggested potential areas of interest.

- Image Analysis has provided a number of driving problems for AOD. Modern images are frequently in 3-d, and the current research focus is on populations of images (as opposed to early challenges, such as denoising a single image). A central problem is registration, e.g. across images handling the problem that organs of interest will be in different locations. There are a variety of approaches to this, all of which involve AOD at some level. One approach is registration via diffeo-morphisms (which themselves naturally lie in a manifold), and these can also be used to analyze population variation. Another is medial representations, which yield a different type of manifold data. Finally, Diffusion Tensor Imaging is naturally analyzed as yet another type of manifold data. A completely different type of AOD image data is trees as data, as discussed in Aydin et al (2008), which are strongly non-Euclidean as noted above. One more challenging data AOD data type comes from Functional Magnetic Resonance Imaging, where each data object is a movie (over time) of 3-d images.
- Bioinformatics data, including microarrays (for gene expression), SNP arrays, proteomics and metabolomics, provides another rich source of driving problems for AOD. While such data sets are typically Euclidean, severe challenges exist because of their HDLSS nature. Major challenges to be investigated during the SAMSI program include data fusion, where the goal is to extract joint information from several of these modalities at once.
- Evolutionary biology has recently actively engaged in FDA methodologies. Examples include the evolution of character traits that correspond to random functions or biodemographic trajectories of mortality, reproduction and other behaviors that are shaped by evolution.The SAMSI program aims to engage with this community, and extend the range of data types, while at the same time developing new methodologies, which can used in other contexts.
- The emerging area of e-commerce and more generally econometrics has fairly recently made contact with AOD. The strongest connection has been in terms of full transcripts of online auction (eBay) bids being viewed as FDA data objects or trajectories of box office receipts of movies after opening day, for example with the goal to predict the overall receipts to be expected for a movie. The proposed SAMSI program aims to carry this research forward, through increased contact with FDA researchers, and through exploring the application of advanced data structures, such as tree or graph structured objects, in this context.
- Psychiatry, psychology and social sciences also have strong connections with AOD. In particular, both autism and schizophrenia have been associated to sizes and shapes of a variety of brain structures. Longitudinal studies often with irregular sampling designs are common in the social sciences. In the presence of nonlinear structures, FDA methodology provides promising alternatives to classical parametric models with random effects. There are also often multivariate time courses and the modling of complex interactions between their components is then of interest. AOD provides an ideal framework and way of thinking about populations of objects of this type.

## Description of Activities

**Workshops:** The *Opening Workshop* was held September 12-15, 2010 at the Radisson RTP. This workshop aimed to engage as large a part of the statistics, mathematics, and relevant scientific communities as possible, with representative sessions from all of the main program topics. The *Transition Workshop* at the end of the program disseminated program results and charted a path for future research in the area. There were, also, mid-program workshops focused on each of the three key research areas mentioned above.

**Working Groups:** Working groups met throughout the program to pursue particular research topics identified in the kickoff workshop (or subsequently chosen by the working group participants). The working groups consisted of SAMSI visitors, postdoctoral fellows, graduate students, and local faculty and scientists.

## Further Information

Additional information about the program: send E-mail to [email protected]