Brian J. Reich
    Assistant Professor
    Department of Statistics
    North Carolina State University

Research on Variable Selection and Dimension Reduction

High-dimensional data are ubiquitous in modern statistics. My work in this area focus on Bayesian methods for both variable selection and dimension reduction. In particular, we have developed an R package entitled BayesPen which performs both Bayesian variable selection and confounder selection. Attractive features of this approach are that it avoids MCMC in special cases, has excellent frequentist properties, and exploits interesting connections between Bayesian decision theory and penalized regression methods. For more information, please find the paper and the package.

Selected papers

Boehm Vock, Reich, Fuentes, Dominici (2015). Spatial variable selection methods for investigating acute health effects of fine particulate matter components . Biometrics.

Multi site time series studies have reported evidence of an association between short term exposure to particulate matter (PM) and adverse health effects, but the effect size varies across the United States. Variability in the effect may partially be due to differing community level exposure and health characteristics, but also due to the chemical composition of PM which is known to vary greatly by location and time. The objective of this paper is to identify particularly harmful components of this chemical mixture. Because of the large number of highly-correlated components, we must incorporate some regularization into a statistical model. We assume that, at each spatial location, the regression coefficients come from a mixture model with the avor of stochastic search variable selection, but utilize a copula to share information about variable inclusion and e ect magnitude across locations. The model differs from current spatial variable selection techniques by accommodating both local and global variable selection. The model is used to study the association between fine PM components, measured at 115 counties nationally over the period 2000-2008, and cardiovascular emergency room admissions among Medicare patients.

Wilson, Reich (2014). Confounder selection via penalized credible regions . Biometrics.

When estimating the effect of an exposure or treatment on an outcome it is important to select the proper subset of confounding variables to include in the model. Including too many covariates increases mean square error on the effect of interest while not including confounding variables biases the exposure effect estimate. We propose a decision-theoretic approach to confounder selection and effect estimation. We first estimate the full standard Bayesian regression model and then post-process the posterior distribution with a loss function that penalizes models omitting important confounders. Our method can be fit easily with existing software and in many situations without the use of Markov chain Monte Carlo methods, resulting in computation on the order of the least squares solution. We prove that the proposed estimator has attractive asymptotic properties. In a simulation study we show that our method outperforms existing methods. We demonstrate our method by estimating the effect of fine particulate matter (PM2.5) exposure on birth weight in Mecklenburg County, North Carolina.

Wilson, Reif, Reich (2014). Hierarchical dose-response modeling for high-throughput toxicity screening of environmental chemicals . Biometrics.

High-throughput screening (HTS) of environmental chemicals is used to identify chemicals with high potential for adverse human health and environmental e ects from among the thousands of untested chemicals. Predicting physiologically-relevant activity with HTS data requires estimating the response of a large number of chemicals across a battery of screening assays based on sparse dose-response data for each chemical-assay combination. Many standard dose-response methods are inadequate because they treat each curve separately and under-perform when there are as few as six to ten observations per curve. We propose a semiparametric Bayesian model that borrows strength across chemicals and assays. Our method directly parametrizes the efficacy and potency of the chemicals as well as the probability of response. We use the ToxCast data from the U.S. Environmental Protection Agency (EPA) as motivation. We demonstrate that our hierarchical method provides more accurate estimates of the probability of response, efficacy, and potency than separate curve estimation in a simulation study.

Bondell, Reich (2012). Consistent high-dimensional Bayesian variable selection via penalized credible regions. JASA.

For high-dimensional data, particularly when the number of predictors greatly exceeds the sample size, selection of relevant predictors for regression is a challenging problem. Bayesian variable selection methods place prior distributions on the parameters along with a prior over model space, or equivalently, a mixture prior on the parameters having mass at zero. Since exhaustive enumeration is not feasible, posterior model probabilities are often obtained via long MCMC runs. The chosen model can depend heavily on various choices for priors and also posterior thresholds. Alternatively, we propose a conjugate prior only on the full model parameters and use sparse solutions within posterior credible regions to perform selection. These posterior credible regions often have closed-form representations, and it is shown that these sparse solutions can be computed via existing algorithms. The approach is shown to outperform common methods in the high-dimensional setting, particularly under correlation. By searching for a sparse solution within a joint credible region, consistent model selection is established. Furthermore, it is shown that, under certain conditions, the use of marginal credible intervals can give consistent selection up to the case where the dimension grows exponentially in the sample size. The proposed approach successfully accomplishes variable selection in the high-dimensional setting, while avoiding pitfalls that plague typical Bayesian variable selection methods.

Reich, Kalendra, Storlie, Bondell, Fuentes (2012). Variable selection for high-dimensional Bayesian density estimation: Application to human exposure simulation. JRSS-C.

Numerous studies have linked ambient air pollution and adverse health outcomes. Many studies of this nature relate outdoor pollution levels measured at a few monitoring stations with health outcomes. Recently, computational methods have been developed to model the distribution of personal exposures, rather than ambient concentration, and then relate the exposure distribution to the health outcome. While these methods show great promise, they are limited by the computational demands of the exposure model. In this paper we propose a method to alleviate these computational burdens with the eventual goal of implementing a national study of the health effects of air pollution exposure. Our approach is to develop a statistical emulator for the exposure model. That is, we use Bayesian density estimation to predict the conditional exposure distribution as a function of several variables, such as temperature, human activity, and physical characteristics of the pollutant. This poses a challenging statistical problem because there are many predictors of the exposure distribution and density estimation is notoriously difficult in high dimensions. To overcome this challenge, we use stochastic search variable selection to identify a subset of the variables that have more than just additive effects on the mean of the exposure distribution. We apply our method to emulate an ozone exposure model in Philadelphia.

Reich, Bondell, Li (2011). Sufficient Dimension Reduction via Bayesian Mixture Modeling. Biometrics.

Dimension reduction is central to an analysis of data with many predictors. Sufficient dimension reduction aims to identify the smallest possible number of linear combinations of the predictors, called the sufficient predictors, that retain all of the information in the predictors about the response distribution. In this paper we propose a Bayesian solution for sufficient dimension reduction. We directly model the response density in terms of the sufficient predictors using a finite mixture model. This approach is computationally efficient and offers a unified framework to handle categorical predictors, missing predictors, and Bayesian variable selection. We illustrate the method using both a simulation study and an analysis of an HIV data set.

Reich, Fuentes, Herring, Evenson (2011). Bayesian variable selection for multivariate spatially-varying coefficient regression. Biometrics.

Physical activity has many well-documented health benefits for cardiovascular fitness and weight control. We consider one of the first studies of pregnant women that examines the impact of characteristics of the built environment on physical activity levels. Using a socioecologic framework, we study the associations between physical activity and several factors including personal characteristics, meteorological/air quality variables, and neighborhood characteristics for pregnant women in four counties of North Carolina. We simultaneously analyze six types of physical activity and investigate cross-dependencies between these activity types. Exploratory analysis suggests that the associations are different in different regions. Therefore we use a multivariate regression model with spatially-varying regression coefficients. This model includes a regression parameter for each covariate at each spatial location. For our data with many predictors, some form of dimension reduction is clearly needed. We introduce a Bayesian variable selection procedure to identify subsets of important variables. Our stochastic search algorithm determines the probabilities that each covariate's effect is null, non-null but constant across space, and spatially-varying. We found that individual level covariates had a greater influence on women's activity levels than neighborhood environmental characteristics, and some individual level covariates had spatially-varying associations with the activity levels of pregnant women.

Reich, Storlie, Bondell (2009). Variable selection in Bayesian smoothing spline ANOVA models: Application to deterministic computer codes. Technometrics .

With many predictors, choosing an appropriate subset of the covariates is a crucial, and difficult, step in nonparametric regression. We propose a Bayesian nonparametric regression model for curve-fitting and variable selection. We use the smoothing spline ANOVA framework to decompose the regression function into interpretable main effect and interaction functions. Stochastic search variable selection via MCMC sampling is used to search for models that fit the data well. Also, we show that variable selection is highly-sensitive to hyperparameter choice and develop a technique to select hyperparameters that control the long-run false positive rate. The method is used to build an emulator for a complex computer model for two-phase fluid flow.

Bondell, Reich (2008). Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR. Biometrics.

In this paper, a new method called the OSCAR (Octagonal Shrinkage and Clustering Algorithm for Regression) is proposed to simultaneously select variables and perform supervised clustering in the context of linear regression. The technique is based on penalized least squares with a geometrically intuitive penalty function that, like the LASSO penalty, shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form clusters represented by a single coefficient. These resulting clusters can then be investigated further to discover what contributes to the group having a similar behavior. The OSCAR then enjoys sparseness in terms of the number of unique coefficients in the model. The proposed procedure is shown to compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and reduced model complexity.