 |
|
Graduate
PhD Dissertations (Statistics), Title and Abstract
2007
2006
2007 |
Student: |
Wei Liu (2007) |
Adviser: |
Wolfgang Polonik |
Title: |
Statistical Network Comparison |
Abstract: |
The study of dynamical random networks (graphs) has attracted a lot of attention in recent years. Statistics is challenging in this context, because in general only a very small number of observed networks is available. An important statistical problem, considered in this research, is to assess topological dissimilarities between networks.
The proposed approach assesses topological dissimilarities between networks indirectly. The structure of the given networks is destroyed by adding noise (this process is called "scrambling"). The amount of noise necessary in order to make the topologies of the scrambled networks statistically indistinguishable is used as a dissimilarity measure.
To follow this approach one has decided on its basic ingredients, such as the way to introduce noise, the way to measure the amount of noise, and the test statistic for comparing topologies of the scrambled networks. Three scrambling methods are proposed that to a certain extend allow to control the level of scrambling imposed on a network. Topologies of networks are compared via the spectral distributions of their (standardized) adjacency matrices. In fact, moments of these spectral distributions are utilized for testing purposes. This is motivated by a recent result of Bai and Yao (1) who derive a functional central limit theorem for an empirical spectral process based on Wigner matrices indexed by analytical functions. We have extended their results slightly to allow for constant diagonal elements in these matrices. This then allows the application of this result to (standardized) adjacency matrices of networks (graphs) without self-edges.
The proposed methodology is evaluated via simulation studies using model based networks and are further applied to some protein-protein networks. |
Reference: |
(1) Bai, Z. and Yao, J. On the convergence of the spectral empirical process of Wigner matrices. Bernoulli 11, 1059-1092 (2005). |
|
|
Student: |
Candace Metoyer (2007) |
Adviser: |
Prabir Burman |
Title: |
Estimation Methods for Linear, Nonlinear, and Multidimensional Time Series: Applications of State-Space Modeling |
Abstract: |
Burman and Shumway (2004) use penalized least-squares to generate estimates for the trend-only linear time series model, Y(t) = T(t) + e(t), where T(t) is called the trend and e(t) is random error. We extend their approach and apply it to the trend plus seasonal linear time series model, Y(t) = T(t) + S(t) + e(t), where S(t) is called the seasonal. We assume that the d-th order trend differences are iid random variables and we assume that the p-th order seasonal sums are iid random variables. Using penalized least-squares, we obtain closed-form expressions for the trend and seasonal estimators. Next, we generalize this method further and consider the class of time series where the distribution of the observation is a member of the exponential family of distributions. We focus on Poisson and Bernoulli time series problems and present an estimation procedure based on the penalized log-likelihood. Last, we consider the class of time series where the observation is a column vector of length M. In this scenario, our first task is dimension reduction. Using a principal components analysis, we reduce the effective dimension from M to m < M, which gives rise to a type of co-integration model. We provide heuristic asymptotic results for all of the estimators and we present applications to real data.
|
|
|
Student: |
Lu Wang (2007) |
Adviser: |
Rudy Beran |
Title: |
Penalization and Rank Reduction |
Abstract: |
The Penalized Total Least Square estimator is based on two types of well-known least square estimator: Penalized Least Square estimator and Total Least Square estimator for unknown response surface with additive noise. We begin by formulating the estimation problem as the rank constrained minimization of a penalized least square problem in order to achieve the Penalized Total Least Square estimator, which leads to consider further classes of candidate estimators for the unknown means in order to achieve lower risk. Adaptation selects the estimator within a candidate class that minimizes the estimated risk, which is an unbiased estimator of the risk function. Under the model assumption, such adaptive estimators minimize risk asymptotically over the class of candidate estimators as the number of rows of the matrix tends to infinity. The so called penalized total least square estimator is applied on both simulated data and real data, both out performs the traditional method in the sense of minimizing risk.
$\theta-$Separable estimator generalizes the idea of penalized least square estimator into broader class. This section deals with the following approach for estimating the mean $m$ of an $n-$dimensional random vector $x$: first, a family $\{A(\theta): \theta \in \Theta\}$ of $n \times n$ matrices is defined. The so called $\theta-$separable matrix depend on a $p\times 1$ unknown parameter vector $\theta$, and have special structure on the eigenvalues and eigenvectors. Examples of such an estimator includes: ANOVA model, ridge regression and multiple shrinkage estimator. Then, James-Stein estimation is introduced as minimization of the risk function. An element $A(\tilde{\theta}): \tilde{\theta} \in \Theta$ is selected by minimizing the $L_2$ risk function. Because the risk function involves the unknown parameter $m$ and variance of noise, instead of minimizing the risk function, $A(\hat{\theta}): \hat{\theta} \in \Theta$ is selected by minimizing the estimated risk function, which is a uniform consistent estimator of risk function. Estimators selected by minimizing estimated risk is also known as Mallows $C_L$ procedure. Generalized Cross Validation methods are also introduced.
The two methods are compared both asymptotically and by numerical experiments. |
|
|
Student: |
Jingjing Ye (2007) |
Adviser: |
David Rocke |
Title: |
Preprocessing and Biomarker Detection Analysis for Biological Mass Spectrometry Data |
Abstract: |
Biomarker detection using mass spectrometry has been billed as having high potential to improve public health. It has also presented considerably great challenges in the statistical analysis of the data with high dimensional data, massive file sizes, noise and complexity. In this dissertation, I propose methods of preprocessing the spectral data to overcome the difficulties for the purpose of extracting valuable information contained in mass spectrometry data.
In this talk, we propose a five-step preprocessing algorithm developed for mass spectrometry M/I data. The algorithm consists of imputation of missing intensities, normalization, integration of fractions, transformation, and selection of potential biomarkers. The five-step preprocessing on the M/I spectra is carried out on mass spectrometry glycomics data, a new emerging research area for detecting biomarkers. The proposed imputation can retain similar information to the raw spectrum and the selection of biomarkers based on statistical models is explored. The algorithm is applied to glycomics prostate and ovarian cancer data with selection of biomarkers incorporated in cross-validation for evaluation. With low misclassification error rates, good precision, and visually and clinically confirmed oligosaccharides detected in the process, we can conclude that the five-step M/I spectrum algorithm is a good choice in preprocessing and conducting differential expression analysis on mass spectrometry data.
Moreover, the methods of linear combination of selected potential biomarkers to achieve better classification are proposed. We investigate a non-parametric approach of maximizing the area under the curve with constrained threshold gradient direct regularization (TGDR-AUC) on the mass spectrometry ovarian glycomics data. Simulations of the method are conducted and proved asymptotic approximation of parameters. In the application of ovarian cancer case, TGDR-AUC is shown to have superior classification in both small biomarker large sample size and large biomarker small sample size scenarios. The method can detect clinical biomarkers, which are confirmed to be oligosaccharides, and provide the flexibility of build-in dimension reduction technique.
The talk shows step by step to mass spectrometry users about preprocessing procedures and biomarker detection methods based on the data. Our proposed methods can solve the purpose of preprocessing M/I spectrum and performing differential expression analysis on the outcome of disease. Thus, the methods are competitive in the analysis of mass spectrometry biological data and if implemented in the software, it will be available for mass spectrometry users to conduct their analysis. |
|
|
Student: |
Shuying Zhu (2007) |
Adviser: |
Rudy Beran |
Title: |
Bootstrap Methods with Applications in Multivariate Analysis |
Abstract: |
This study is concerned with certain classical methods in hypothesis testing and construction of simultaneous confidence sets in multivariate linear analysis. Three approaches in hypothesis testing are proposed: Asymptotic, bootstrap, and prepivoting methods. The performance of asymptotic method depends strongly on the availability of the asymptotic expansion. The asymptotic test statistic is first order correct; that is, its coverage differs from the nominal level by. Although certain classical refinements to asymptotical tests have been shown to be second order accurate, however, they are too cumbersome for analytical approach. Examples include Yao's (1965) approximate degrees of freedom method to the multivariate Behrens-Fisher problem and the multivariate Bartlett adjustment to the chi-squared asymptotics. The analytical difficulties are in the sense of computing the degrees of freedom and recovering the Bartlett factor. The bootstrap method, however, avoids such difficulties and can be approximated directly by a Monte Carlo algorithm. The principal aim of the present investigation is to compare the bootstrap method to the refinements of the asymptotic method in theory and in simulation. It is shown that the appropriate bootstrap test based on the Behrens-Fisher statistic is equivalent to James's first order asymptotic series and Yao's approximate degrees of freedom test; and the appropriate bootstrap likelihood ratio test automatically accomplishes Bartlett's adjustment to the chi-squared asymptotics. In addition, prepivoting any test statistic before forming a bootstrap test reduces the order of the error in rejection probability. The prepivoting can be iterated.
The problem of constructing simultaneous confidence sets in multivariate linear analysis is considered. In the case when the normality assumption does not satisfy, the classical method such as pivotal method is too difficult for analytical approach. One way to improve this problem is to employ the nonparametric bootstrapping method that underlies Beran's (1988) bootstrapped roots method. Under stringent conditions, it is shown that the bootstrapped roots method overcomes distributional difficulties and generates simultaneous confidence sets such that the overall coverage probability is correct and the coverage probabilities of the individual confidence sets are equal in both multivariate regression and multivariate analysis of variance, respectively. For the special case in multivariate analysis of variance where the normality is presented, the projection method is proposed. It is shown that the projection method is not only suitable for balanced complete layout but also for the unbalanced complete layout. |
back to top |
|
|
2006 |
Student: |
Jimin Ding (2006) |
Adviser: |
Jane-Ling Wang |
Title: |
Joint Modelling of Survival and Longitudinal Data |
Abstract: |
In clinical studies, longitudinal covariates are often used to monitor the progression of the disease as well as survival time. Relationship between a failure time process and some longitudinal covariates is of key interest and so is the understanding of the pattern of longitudinal process to learn more about health status of patients, or to get some insight into the progression of disease. Joint modeling of the longitudinal and survival data has certain advantages and emerged as an effective way to gain information from each other. Typically, a parametric longitudinal model is assumed to facilitate the likelihood approach. However, the choice of a proper parametric model turns out more illusive than standard longitudinal studies where no survival end-point occurs. Furthermore, the computational burden due to both Monte Carlo numerical integration and EM (Expected Maximum) algorithm is an important concern in the joint modelling setting.
To deal with those challenges, in the first part of the talk, I will propose several nonparametric longitudinal models in the joint modelling setting. Longitudinal process is represented by some basis functions and a proportional hazard model is then used to link them with the event-time.
Unknown model parameters are estimated through maximizing the observed joint likelihood, which are iteratively maximized by the Monte Carlo Expected Maximization (MCEM) algorithm. The simplicity of the model structure is crucial to have good numerical stability, and so the parsimonious nonparametric models have computational advantages and compare well to competing parametric longitudinal approaches. In the second part of the talk, I will introduce the method of sieves for joint modelling to illustrate the high dimensionality problem currently encountered in the joint modelling literature. The asymptotic properties of the proposed sieve estimator will be discussed. |
|
|
Student: |
Joshua Kerr (2006) |
Adviser: |
Robert Shumway / Wolfgang Polonik |
Title: |
Signal Extraction for Seismic Array Data Via Partially Linear Least-Squares |
Abstract: |
Signal extraction is, and has been, a very important field for quite some time, and for good reason. Upon receiving a seismic reading at a site, the goal is to extricate the signal of interest from the noise-polluted reading attained. This, in and of itself, is a daunting task that has been grappled with. The task mentioned is further compounded when there are multiple signals of interest embedded within the noisy reading.
Background will be given followed by techniques to estimate how many signals are present, and estimate the velocity and azimuth of each. Asymptotics are developed to provide consistency and distributional results for the parameters of interest. Finally, a data example will be shown followed by some summary remarks. |
Reference: |
Pollard, David and Radchenko, Peter. Nonlinear Least-Squares Estimation.
Journal of Multivariate Analysis. In Press.
R.H. Shumway. On Detecting a Signal in n Stationarily correlated Noise Series. Technometrics, 13:499-519, 1974.
C.F. Wu. Asymptotic Theory of Nonlinear Least Squares Estimation. Annals of Statistics. 9:501-513, 1981. |
|
|
Student: |
Shanmei Liao (2006) |
Adviser: |
Rudy Beran |
Title: |
Application of Bootstrap Confidence Region for multivariate analysis |
Abstract: |
Bootstrap confidence regions are applied to two multivariate studies in this article. One is for population covariance matrices, where two problems are considered: 1) A set of bootstrap confidence regions generated for each component of a covariance matrix may not induce a confidence region of positive definite covariance matrices. 2) Besides controlling the overall coverage probability of the confidence region, it is desirable to keep equal the coverage probabilities of the individual confidence intervals that define the simultaneous region. Unconstrained parameterizations for covariance matrices are used to assure the positive definiteness of the covariance matrices estimators. Bootstrap simultaneous confidence regions are generated to balance the coverage probability of each component in the matrix. As an application, these confidence regions are used to test assumptions on the structure of a covariance matrix.
The second part is the application on camera calibration models. A simple and flexible model was given by Zhang's (1998) as a new technique to estimate intrinsic and extrinsic camera parameters, while the accuracy of these estimators has not been investigated in his paper. The numerical algorithms used in Zhang's procedure are refined and both parametric and nonparametric bootstrap methods are applied to obtain the simultaneous bootstrap confidence regions for parameters, in which way tests on these parameters can be operated. |
|
|
Student: |
Nan Zhang (2006) |
Adviser: |
Hans-Georg Müller |
Title: |
Functional Data Analysis for Non-Gaussian Longitudinal Data |
Abstract: |
We propose a nonparametric method to preform functional principal components analysis for the case of non-Gaussian longitudinal data, assuming the underlying process is hidden or unobservable.
In this framework, we deal with a sample of curves which give rise to noisy non-Gaussian repeated measurements, such as Poisson counts or Binomial data. The measurements for each subject are assumed to be determined by a subject-specific smooth random trajectory plus measurement errors. A link function relates subject-specific trajectories to an underlying latent Gaussian process and is modelled by an eigenfunction expansion with random coefficients. Basic elements of our approach are the estimation of the covariance structure and mean function of the latent Gaussian process, the estimation of the overdispersion parameter and the estimation of the variance of the measurement errors. The eigenfunction basis is estimated from the data, and functional principal component score estimates are obtained by maximizing the quasi-likelihood. A key step is the derivation of asymptotic consistency and distribution results under mild conditions, using tools from functional analysis. We develop a model selection technique, functional Akaike information criterion, to choose the number of principal components for the eigenfunction expansion.
The proposed framework is compared to other approaches, including Character Process Models, Cubic B-spline Models and functional principal components analysis through conditional expectation approach by simulation studies. Finally, the proposed approach is illustrated with French-Canadian fertility data, Medfly egg-laying data and Rats learning behavioral data.
|
back to top |
|
 |