Highdimensional data are ubiquitous in modern statistics. My work in this area focus on Bayesian methods for both variable selection and dimension reduction. In particular,
we have developed an R package entitled BayesPen which performs both
Bayesian variable selection and confounder selection. Attractive features of this
approach are that it avoids MCMC in special cases, has excellent frequentist properties, and exploits interesting
connections between Bayesian decision theory and penalized regression methods. For more information, please
find the
paper
and the
package.
Selected papers

Multi site time series studies have reported evidence of an association between short term exposure to
particulate matter (PM) and adverse health effects, but the effect size varies across the United States. Variability in
the effect may partially be due to differing community level exposure and health characteristics, but also due to the
chemical composition of PM which is known to vary greatly by location and time. The objective of this paper is to
identify particularly harmful components of this chemical mixture. Because of the large number of highlycorrelated
components, we must incorporate some regularization into a statistical model. We assume that, at each spatial
location, the regression coefficients come from a mixture model with the
avor of stochastic search variable selection,
but utilize a copula to share information about variable inclusion and eect magnitude across locations. The model
differs from current spatial variable selection techniques by accommodating both local and global variable selection.
The model is used to study the association between fine PM components, measured at 115 counties
nationally over the period 20002008, and cardiovascular emergency room admissions among Medicare patients.



When estimating the effect of an exposure or treatment on an outcome it is important
to select the proper subset of confounding variables to include in the model.
Including too many covariates increases mean square error on the effect of interest
while not including confounding variables biases the exposure effect estimate.
We propose a decisiontheoretic approach to confounder selection and effect estimation.
We first estimate the full standard Bayesian regression model and then postprocess the
posterior distribution with a loss function that penalizes models omitting important
confounders. Our method can be fit easily with existing software and in many situations
without the use of Markov chain Monte Carlo methods, resulting in computation on the order
of the least squares solution. We prove that the proposed estimator has attractive
asymptotic properties. In a simulation study we show that our method outperforms
existing methods. We demonstrate our method by estimating the effect of fine
particulate matter (PM2.5) exposure on birth weight in Mecklenburg County, North Carolina.



Highthroughput screening (HTS) of environmental chemicals is used to identify chemicals with high potential
for adverse human health and environmental eects from among the thousands of untested chemicals. Predicting
physiologicallyrelevant activity with HTS data requires estimating the response of a large number of chemicals
across a battery of screening assays based on sparse doseresponse data for each chemicalassay combination. Many
standard doseresponse methods are inadequate because they treat each curve separately and underperform when
there are as few as six to ten observations per curve. We propose a semiparametric Bayesian model that borrows
strength across chemicals and assays. Our method directly parametrizes the efficacy and potency of the chemicals as
well as the probability of response. We use the ToxCast data from the U.S. Environmental Protection Agency (EPA)
as motivation. We demonstrate that our hierarchical method provides more accurate estimates of the probability
of response, efficacy, and potency than separate curve estimation in a simulation study.



For highdimensional data, particularly when the number of predictors greatly exceeds the sample size, selection of relevant predictors for regression is a challenging problem. Bayesian variable selection methods place prior distributions on the parameters along with a prior over model space, or equivalently, a mixture prior on the parameters having mass at zero. Since exhaustive enumeration is not feasible, posterior model probabilities are often obtained via long MCMC runs. The chosen model can depend heavily on various choices for priors and also posterior thresholds. Alternatively, we propose a conjugate prior only on the full model parameters and use sparse solutions within posterior credible regions to perform selection. These posterior credible regions often have closedform representations, and it is shown that these sparse solutions can be computed via existing algorithms. The approach is shown to outperform common methods in the highdimensional setting, particularly under correlation. By searching for a sparse solution within a joint credible region, consistent model selection is established. Furthermore, it is shown that, under certain conditions, the use of
marginal credible intervals can give consistent selection up to the case where the dimension grows exponentially in the sample size. The proposed approach successfully accomplishes
variable selection in the highdimensional setting, while avoiding pitfalls that plague typical Bayesian variable selection methods. 


Numerous studies have linked ambient air pollution and adverse health outcomes. Many studies of this nature relate outdoor pollution levels measured at a few monitoring stations with
health outcomes. Recently, computational methods have been developed to model the distribution of personal exposures, rather than ambient concentration, and then relate the exposure
distribution to the health outcome. While these methods show great promise, they are limited by the computational demands of the exposure model. In this paper we propose a method to alleviate these computational burdens with the eventual goal of implementing a national study of the health effects of air pollution exposure. Our approach is to develop a statistical
emulator for the exposure model. That is, we use Bayesian density estimation to predict the conditional exposure distribution as a function of several variables, such as temperature, human activity, and physical characteristics of the pollutant. This poses a challenging statistical
problem because there are many predictors of the exposure distribution and density estimation is notoriously difficult in high dimensions. To overcome this challenge, we use stochastic search variable selection to identify a subset of the variables that have more than just additive effects on the mean of the exposure distribution. We apply our method to emulate an ozone exposure model in Philadelphia. 


Dimension reduction is central to an analysis of data with many predictors. Sufficient dimension reduction aims to identify the smallest possible number of linear combinations of the predictors, called the sufficient predictors, that retain all of the information in the predictors about the response distribution. In this paper we propose a Bayesian solution
for sufficient dimension reduction. We directly model the response density in terms of the sufficient predictors using a finite mixture model. This approach is computationally
efficient and offers a unified framework to handle categorical predictors, missing predictors, and Bayesian variable selection. We illustrate the method using both a simulation study
and an analysis of an HIV data set. 


Physical activity has many welldocumented health benefits for cardiovascular fitness and weight control. We consider one of the first studies of pregnant women that examines the impact of characteristics of the built environment on physical activity levels. Using a socioecologic framework, we study the associations between physical activity and several factors including personal characteristics, meteorological/air quality variables, and neighborhood characteristics for pregnant women in four counties of North Carolina. We simultaneously analyze six types of physical activity and investigate crossdependencies between these activity types. Exploratory analysis suggests that the associations are different in different regions. Therefore we use a multivariate regression model with spatiallyvarying regression coefficients. This model includes a regression parameter for each covariate at each spatial location. For our data with many predictors, some form of dimension reduction is clearly needed. We introduce a Bayesian variable selection procedure to identify subsets of important variables. Our
stochastic search algorithm determines the probabilities that each covariate's effect is null, nonnull but constant across space, and spatiallyvarying. We found that individual level covariates had a greater influence on women's activity levels than neighborhood environmental characteristics, and some individual level covariates had spatiallyvarying associations with the activity levels of pregnant women. 


With many predictors, choosing an appropriate subset of the covariates is a crucial, and difficult, step in nonparametric regression. We propose a Bayesian nonparametric regression model for curvefitting and variable selection. We use the smoothing spline ANOVA framework to decompose the regression function into interpretable main effect and interaction functions. Stochastic search variable selection via MCMC sampling is used to search
for models that fit the data well. Also, we show that variable selection is highlysensitive to hyperparameter choice and develop a technique to select hyperparameters that control
the longrun false positive rate. The method is used to build an emulator for a complex computer model for twophase fluid flow. 


In this paper, a new method called the OSCAR (Octagonal Shrinkage and Clustering Algorithm for Regression) is proposed to simultaneously select variables and perform supervised clustering in the context of linear regression. The technique is
based on penalized least squares with a geometrically intuitive penalty function that, like the LASSO penalty, shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form clusters represented by a single coefficient. These resulting clusters can then be investigated further to discover what contributes to the group having a similar behavior. The OSCAR then enjoys sparseness in terms of the number of unique coefficients in the model. The proposed procedure is shown to
compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and reduced model complexity. 

