## PhD Dissertation Abstracts: 2014

## Statistics PhD Alumni 2014:

**Gabriel Becker****Apratim Ganguly****Jinjiang He****Erin Melcon****Irina Udaltsova****Ka Wai Wong****Cong Xu****Xiaoke Zhang**

### Gabriel Becker (2014)

ADVISER: Duncan Temple Lang

TITLE: **Rethinking dynamic documents for data analytic research**

ABSTRACT: The need for verifiability in computational results has been highlighted by a number of recent failures to reproduce published data analytic findings. Most efforts to ensure reproducibility involve providing guarantees that reported results can be generated from the data via the reported methods, with a popular avenue being dynamic documents. This insurance is necessary but not sufficient for full validation; inappropriately chosen methods will reproduce questionable results.

I will present the concept of comprehensive dynamic documents. These documents represent the full breadth of an analyst's work during computational research, including code and text describing: intermediate and exploratory computations, alternate methods, and even ideas the analyst had which were not fully pursued. Furthermore, additional information can be embedded in the documents such as data provenance, experimental design, or details of the computing system on which the work was originally performed.

These comprehensive documents act as databases, encompassing both the work that the analyst has performed and the relationships among specific pieces of that work. This allows us to investigate research in a number of ways difficult or impossible to achieve given only a description of the final strategy. We can explore the choice of methods and whether due diligence was performed during an analysis. Secondly, we can compare alternative strategies either side-by-side or interactively. Finally, we can treat these complex documents as data about the research process and analyze them programmatically.

I have developed a proof-of-concept set of software tools for working with comprehensive dynamic documents. This includes conceptual models for representing and processing the documents, an R package that implements those models, and an extension of the IPython Notebook platform that allows users to author and interactively view them.

### Apratim Ganguly (2014)

ADVISER: Wolfgang Polonik

TITLE: **Applications and Theoretical Properties of Local Geometry Based Structure Learning Methods in Gaussian Graphical Models**

ABSTRACT: Occurrence of zero entries in the inverse covariance matrix of a multivariate Gaussian random variable has a one-one correspondence with conditional independence of corresponding pair of components. A challenging aspect of sparse structure learning is the well known "small n large p" scenario. So far, several algorithms have been proposed to solve the problem. Neighborhood selection using lasso (Meinshausen- Bühlmann), Block-coordinate descent algorithm to estimate the covariance matrix (Banerjee et al.), Graphical LASSO (Tibshirani et al) are some of the most popular ones.

In the first part of this talk, I will present an alternative approach for Gaussian graphical models on manifolds where spatial information is judiciously incorporated into the estimation procedure. Honorio et al. (2009) have proposed an extension to coordinate descent approach, calling it “coordinate direction descent approach”, which incorporates the local constancy property of spatial neighbors. However, the theoretical properties (like consistency, sign consistency, selection of regularization parameters etc) are not dealt with. I shall propose an algorithm to deal with local geometry in Gaussian graphical models. This is an extension of Meinshausen-Bühlmann's idea of successive regression using fused lasso penalty. Neighborhood information is used in the penalty term and we call it neighborhood-fused lasso algorithm. I will show by simulation and prove theoretically the asymptotic model selection consistency of our proposed method and will establish faster convergence to ground truth than the standard rates if the assumption of local constancy holds. This modification has numerous practical application, e.g., analyzing MRI data, 2-dimensional spatial manifold data in order to study spatial behavior of human brain or moving objects.

In the second part of my talk, I will briefly talk about smoothing techniques on Riemannian manifolds. Estimation of smoothed diffusion tensor from Diffusion Weighted Magnetic Resonance Images (DW-MRI or DWI) of human brain is usually a two-step procedure, the first step being a regression (linear/non-linear) and the second step being a smoothing (isotropic/anisotropic). In this project, I extended the smoothing ideas on Euclidean space to non-Euclidean space by running a conjugate gradient algorithm on the manifold of positive definite matrices. This method shows empirical evidence of a better performance than the two-step method of smoothing.

### Jinjiang He (2014)

ADVISER: Jane-Ling Wang

TITLE: **Functional Correlations to Quantify Functional Connectivity in Brain Imaging**

ABSTRACT: Quantifying time series similarity between pairs of voxels is central to the study of functional connectivity from blood oxygen level dependent (BOLD) functional magnetic resonance imaging (fMRI). Temporal Pearson correlation is frequently used to measure time series similarity, but it ignores statistical dependencies between time points in a time series, and in some situations leads to a biased group-level correlation measure.

We consider BOLD fMRI time series at each voxel of a subject as time-discretized observations of a random function, and employ a functional data approach to evaluate functional connectivity.

Two corrections for Pearson correlation are proposed in this thesis, including a variant that first calculates the Pearson correlation across subjects in a group at a fixed time point and then views these correlations over time, resulting in a correlation function over time. A simple summary of these correlations based on the integration of these correlations over the time period adjusted by the length of the time interval is termed integrated correlation (IC), which provides a functional correlation measure. Root-n consistency of IC is established and its numerical performance is evaluated through simulations. Analysis based on resting state fMRI scans of 231 elderly individuals suggests that IC provides superior sensitivity to the effects of age on functional connectivity in cognitively normal elders.

The second functional correlation measure, termed angular correlation, identifies significant differences between normal elders and Alzheimer's Disease patients. We conclude that the choice of correlation measure could have important practical implications for BOLD fMRI functional connectivity studies.

### Erin Melcon (2014)

ADVISERS: Jiming Jiang

TITLE: **Penalty parameter selection in generalized linear models and linear mixed models**

ABSTRACT: Penalized likelihood methods such as lasso, adaptive lasso, and SCAD have been highly utilized in linear models. Selection of the penalty parameter is an important step in modeling with penalized techniques. Although methods of selecting this have been evaluated in linear models, general linear models and linear mixed models have not been so thoroughly explored.

This presentation will introduce a data-driven bootstrap (Empirical Optimal Selection, or EOS) approach for selecting the penalty parameter with a focus on model selection. We implement EOS on selecting the penalty parameter in the case of lasso and adaptive lasso. In generalized linear models we will introduce the method, show simulations comparing EOS to information criteria and cross validation, and give theoretical justification for this approach. We also consider a practical upper bound for the penalty parameter, with theoretical justification.

In linear mixed models, we use EOS with two different objective functions; the traditional log-likelihood approach (which requires an EM algorithm), and a predictive approach. In both of these cases, we compare selecting the penalty parameter with EOS to selection with information criteria. Theoretical justification for both objective functions, and a practical upper bound for the penalty parameter in the log-likelihood case are given.

We also applied our technique to two datasets; the South African heart data (logistic regression) and the Yale infant data (a linear mixed model). For the South African data, we compare the final models using EOS and information criteria via the mean squared prediction error (MSPE). For the Yale infant data, we compare our results to those obtained by Ibrahim et al. (2011).

### Irina Udaltsova (2014)

ADVISERS: Jiming Jiang / Ethan Anderes

TITLE: **Bayesian Estimation of log (N >S) − log S**

ABSTRACT: In cosmology, the study of source populations is often conducted using the cumulative distribution of the number of sources detected at a given sensitivity. The resulting "log(N>S) - logS" distribution can be used to compare and evaluate theoretical models for source populations and their evolution. In practice, however, inferring properties of the source populations from cosmological observational data is complicated by the presence of detector-induced uncertainty and bias. This includes background contamination, uncertainty on both intensity and location of the observed sources, and, most challenging, the issue of non-detections or unobserved sources. Since the probability of a non-detection is a function of the unobserved flux, the missing data mechanism is non-ignorable. We present a computationally efficient Bayesian approach for inferring model parameters and the corrected log(N>S) - logS distribution for source populations. Our method extends existing work by allowing for joint estimation of both properties of the non-ignorable missing data process and the unknown number of unobserved sources. By correcting for the non-ignorable missing data mechanism and other detection phenomena, we are able to obtain corrected estimates of the flux distribution of partially observed source populations. We also present a procedure for examining the goodness-of-fit of our hierarchical Bayesian model and propose a novel approach for model selection in Bayesian settings.

### Ka Wai (Raymond) Wong (2014)

ADVISER: Thomas Lee

TITLE: **Fiber Direction Estimation in Diffusion MRI**

ABSTRACT: The recent advancement of neuroimaging technology has generated a huge amount of brain imaging data. These images are not only large, but also complex. They carry the mission of understanding the incredibly complicated brain structures. The analysis of such data poses many challenging statistical problems that require both accurate modeling and fast algorithms.

In this talk, I will focus on diffusion magnetic resonance imaging (dMRI), which is an emerging medical imaging technique for probing anatomical architectures of biological samples. It is widely used to reconstruct white matter fiber tracts in brains. I will begin by introducing dMRI data and a technique called diffusion tensor imaging (DTI) that is commonly used for analyzing these dMRI data. Then I will explain why most existing DTI methods would fail in regions when multiple fibers share the same voxel (a 3D imaging pixel). To overcome this issue, I will introduce a fundamentally different route to reconstruct crossing fibers. An important ingredient is a nonconventional spatial smoothing procedure that can be applied to improve the estimates of diffusion direction. Finally, I will illustrate the proposed technique on a real dMRI data set.

### Cong Xu (2014)

ADVISER: Jane-Ling Wang

TITLE: **Semiparametric analysis of incomplete survival data**

ABSTRACT: Doubly censored data, which are subject to both left- and right-censoring, are not uncommon in lifetime data. The censoring mechanism is different from the case with right-censoring only and the situation is even more complicated when the data are clustered. In this talk, we consider a class of linear transformation models, which includes the proportional hazards and the proportional odds models as special cases, to model this kind of data. While the transformation models expand the horizon of survival models, they pose considerable computational challenges to the likelihood approach, especially for doubly-censored clustered data. An effective EM algorithm is developed to overcome the computational difficulties and it leads to stable nonparametric maximum likelihood estimates (NPMLEs). The NPMLEs are shown to be consistent and the estimates of the finite-dimensional parameters are semiparametric efficient. A computationally efficient method is proposed to estimate the standard errors of the parameter estimates. Simulation studies demonstrate that the proposed EM-algorithm and SE estimates perform well, both are then applied to a dataset from a Hepatitis B clinical study.

### Xiaoke Zhang (2014)

ADVISER: Jane-Ling Wang

TITLE: **A Unified Theory and A Time-Varying Additive Model for Functional and Longitudinal Data**

ABSTRACT: Functional data analysis (FDA), which deals with a sample of functions or curves, has gained increasing importance in modern data analysis due to the improved capability to record and store a vast amount of data and advances in scientific computing. Plenty of approaches have been proposed that have satisfactory theoretical and numerical performance for traditional functional data. However, many interesting theoretical problems in FDA remain unresolved, and new methods are still needed to capture the time-dynamic features of functional data.

In this talk I will primarily focus on the nonparametric estimation of the mean and covariance functions in FDA. We investigate the local polynomial smoothers for both mean and covariance functions with two weighing schemes: equal weight per observation, and equal weight per subject. We provide a comprehensive analysis of the asymptotic properties on a unified platform for all types of sampling plan, be it dense, sparse, or neither. The asymptotic theories are unified on two aspects: (1) the number of observations per subject Ni, can be fixed or random; (2) the magnitude of Ni is allowed to be any order of the sample size. Based on the relative order of Ni to the sample size n, functional data are partitioned into three types: non-dense, dense, and ultra-dense functional data. The two weighing schemes are compared both theoretically and numerically. Finally I will also give a brief introduction to a time-varying additive model which not only reduces the dimensionality but also captures the time-dynamic features of longitudinal data.