Student Project Presentations (ST 740 )
Students from ST 740 are required to present a study of their choice
during the final week of the class. Here is the list of titles, abstracts of the projects:
- Fall semester, 2006
- Team members: Suraj Anand
- Presentation time: Dec. 6, 10:15-10:30
- Title: A Bayesian Approach to Assess the Risk of QT Prolongation of an
Investigational New Drug
- Abstract: An undesirable property associated with some non-antiarrhythmic
drugs is their ability to delay cardiac repolarization, more generally
known as QT prolongation. The standard approach to investigating a drug
for its potential for QT prolongation is to construct a 90% two-sided (or
a 95% one-sided) confidence interval (CI), for the difference in mean QTc
(a heart-rate corrected version of QT) between drug and placebo at each
time point, and to conclude non-inferiority if the upper limits for all
these CIs are less than a pre-specified constant. One alternative approach
is to base the non-inferiority inference on the largest difference in
population mean QTc between drug and placebo. In this project paper, we
propose a formal solution to this problem based on a Bayesian framework
using a Monte Carlo simulation method. The proposed method has several
advantages over some of the other existing methods and it is easy to
implement in practice. We use simulated data to assess the appropriateness
of this approach and apply the method to a real data set on QTc to make a
formal and thorough statistical inference.
- Data Source:
Bioequivalence and Statistics in Clinical Pharmacology
- Original Source: Patterson S. and Jones B. (2005). 'Bioequivalence
and Statistics in Clinical Pharmacology, Chapman and Hall, CRC Press, London.
- Code used: R code
- Team members: Eric Belasco
- Presentation time: Dec. 6, 10:35-10:50
- Title: A Bayesian Approach to Analyzing Cattle Feeding Yield Risks
- Abstract: The development of livestock insurance programs as part of the Agricultural Risk Protection
Act of 2000 has motivated research investigating risks affecting cattle feeding enterprises. Current
research indicates that three cattle yield measures can be jointly modeled through the use of a multivariate
lognormal distribution with multiplicative heteroskedasticity under a Tobit formulation. This is largely
due to the highly correlated nature of the yield factors and the degree of censoring in the dependent variables.
A major concern lies in the modeling of the censored component of the dataset. This research will model
yields using a Bayesian approach and compare it to that of a frequentist approach. Also, Bayes Factors
will be used to compare this model to one that uses a zero-inflated Gamma distribution.
- Data Source: undisclosed
- Original Source: Belasco, Taylor, Goodwin, and Schroeder. Probabilistic Models of Yield,
Price, and Revenue Risk for Fed Cattle Production (unpublished).
- Code used: WinBUGS code
- Team members: Carl DiCasoli
- Presentation time: Dec. 6, 10:55-11:10
- Title: A Bayesian approach to estimation of survival probability using a
posterior distribution based on the Yang-Prentice Model
- Abstract: In this project, we will determine if a Bayesian method for
estimating the short-term and long-term hazard ratio is an improvement over two
other proposed sets of estimates. These estimates include those from the semi-
parametric, Yang-Prentice model and the Kaplan-Meier estimate. In determining both
the Kaplan-Meier estimate and the semi-parametric, Yang Prentice model
estimates, we have assumed that the sample size is large, and that the response
variable (the time to death or hospitalization) between control and treatment
groups is independent. Additionally, in the semi-parametric, Yang-Prentice
model we have made another assumption, which is to take into account the short-
term and long-term hazard ratios, as well as the large sample size. However,
the estimates from the Bayesian approach do not depend upon having a large
sample size because the model that we will be using is parametric. That is, the
Bayesian model assumes that the control group's hazard ratio is coming from a
Weibull distribution. If the Bayes estimates produce a good fit when graphing
the survival probability against the response as compared to the Kaplan-Meier
estimate, then we can conclude that the Bayes procedure is more optimal than
either of the frequentist methods since we do not have to take into account the sample size.
- Data Source: datasets
- Original Source: Yang, Song and Prentice, Ross. Semiparametric analysis of short-term and
long-term hazard ratios with two-sample survival data, Biometrika (2005), 92,
1, pp. 1-17.
Tsiatis, Anastasios, Zhang, Daowen, and Lu, Wenbin. ST 745, Spring 2006,
Analysis of Survival Data, Lecture notes 2, pp. 8-12.
- Code used: WinBUGS code
- Team members: Judith Canner
- Presentation time: Dec. 6, 4:30-4:45
- Title: Neutral Model of Species Distribution Patterns
- Abstract: Ecologists have long sought to explain species distribution
patterns. In Gelfand et al (2003), the authors use a hierarchical
Bayesian model to incorporate spatially associated random effects to
explain distribution patterns of plants in the Cape Floristic Region in
South Africa. In a similar framework, I will analyze species distribution
patterns of ants in Southern Appalachia (Smokey Mountains). I will assume
that all ant species are biologically similar and will use a Bayesian
hierarchical model to test whether habitat characteristics are sufficient
to predict species distribution patterns. I will parameterize the model
with species presence/absence and site characteristic data (elevation,
temperature, litter depth, etc.) from Dunn et al (in press). I will then
compare predicted species presence/absence to observed species
presence/absence to test if habitat characteristics are sufficient to
predict species distribution patterns.
- Data Source: Species Sites and Site
Characteristics
- Original Source: Gelfand et al (2003). Explaining Species Distribution Patterns through
Hierarchical Modeling. Bayesian Analysis 1(1):1-35.
- Code used: WinBUGS code
- Team members: Liz Nelson and Mathew Krachey
- Presentation time: Dec. 6, 4:50-5:05
- Title: Bayesian modification to Brownie models applied to Cayuga lake trout
- Abstract: Resource biologists concerned with conservation of exploited animal stocks
must assess key population demographics for management plans. One
demographic of particular interest is the harvest mortality rate, which
can be controlled if the target species is at critically low population
levels. The estimation of harvest mortality can be complicated since
overall mortality includes both natural and harvest mortality. In
addition, harvested stocks are often highly migratory, making population
size estimates unreliable. Brownie (1978, 1985) proposed a multinomial
model to differentiate natural mortality and harvest mortality in
harvested populations through tag return studies. We propose a Bayesian
modification of the Brownie model using non-informative priors. Analysis
will be based on the 1960-1964 Cayuga lake trout data set provided by
Hoenig et al (1998). Model performance will be assessed by comparison with
previous studies.
- Data Source: Table 4 of the original source below
- Original Source: Hoenig, J.M., N.J. Barrowman, K.H. Pollock, E.N. Brooks, W.S. Hearn and T.
Polacheck. 1998. Models for tagging data that allow for incomplete mixing
of newly tagged animals. Can. J. Fish. Aquat. Sci. 55:1477-1483.
- Code used: WinBUGS code
- Team members: Yan Zhang and Jin Huang
- Team members: Arun Krishna, Laine Elliot and Eren Demirhan
- Presentation time: Dec. 8, 10:15-10:30
- Title: Bayesian Variable Selection in Linear Models with Zellner's prior
- Abstract: A common method to solve variable selection problems in linear models is
minimizing a penalized sum of squares, where most of the methods differ in
the penalty function. The earlier methods such as best subset selection,
Mallow's C, Bayesian and Akaike Information Criteria use a penalty
function where only the number of parameters is penalized. Recent methods
are mostly based on penalty functions including the norms of parameter
estimates, or combination of them. Ridge Regression, LASSO and Elastic Net
can be considered as popular examples of these shrinkage regression
models. These problems can be considered under a Bayesian framework where
priors of the parameters are represented as proportional to the penalty
functions (Ghosh 2006). In this paper, we will talk about a specific
approach to performing variable selection by using Zellner's prior, which
depends on the design matrix X. We will then compare this method to the
recent attempt made by Casella and Moreno (2006) where the variable
selection problem was solved by using automatic priors. The data we will
use to make this comparison is the "ancient" Hald data (Casella and
Moreno, 06) which measures the effect of heat on the composition of
cement.
- Data Source: Hald data
- Original Source: Casella, G. and Moreno, E. (2006). Objective Bayesian variable selection, Journal of the American Statistical
Association, 101, 157-167.
- Code used: R code
- Team members: Dhruv Sharma
- Presentation time: Dec. 8, 10:35-10:50
- Title: Variable Selection via Bayesian Optimization
- Abstract: A criterion-based and fully automatic Bayesian variable selection method
is proposed for the canonical linear regression model. We consider a full
hierarchical model with a conjugate prior with multivariate normal for the
regression coefficients and inverse gamma for the error variance. A new
loss function is proposed and the Bayes estimator that minimizes the
posterior expected loss is obtained under the given setup. The proposed
estimator is validated using simulated data and illustrated using the Hald
data set. We also compare our results with those using some of the
currently available methods.
- Data Source: Hald data
- Original Source: Casella, G. and Moreno, E. (2006). Objective Bayesian variable selection, Journal of the American Statistical
Association, 101, 157-167.
- Code used: R code
- Team members: Haojun Ouyang and Yuefeng Wu
- Presentation time: Dec. 8, 10:55-11:10
- Title: Passenger Car Mileage: A Bayesian Approach
- Abstract: Passenger cars made by different makes and models have
different gasoline mileage. Also, the gasoline mileage is influenced by
the weight and horsepower of the vehicles. The data set contains the
makes, models, engine horsepower, top speed, behicle weight cubic feet of
cab space and average miles per gallon. The model we planned to use is
y=X*beta+error. We plan to do the Bayesian regression (G-prior) and test
for significance of each effect, and compare our results to the ones we
will get by using non-Bayesian regression.
- Data Source: DASL: Passenger Car Mileage
- Original Source: R.M. Heavenrich, J.D. Murrell, and K.H. Hellman, Light
Duty Automotive Technology and Fuel Economy Trends Through 1991, U.S.
Environmental Protection Agency, 1991 (EPA/AA/CTAB/91-02).
- Code used: WinBUGS code
- Team members: Kaushal Mishra
- Presentation time: Dec. 8, 4:30-4:45
- Title: Estimation of scram rate trends in Nuclear Power Plants using hierarchical
Bayesian Model
- Abstract: Nuclear Reactors are equipped with reactor scram (sudden insertion of control
rods) systems to ensure rapid shutdown of the system in the event of leaks,
failure of power conversion systems, or other operational abnormalities. The
U.S. Nuclear Regulatory Commission (NRC) collects data of scram rate for various
nuclear power plants to obtain their trend of proper functioning over time and
to regulate them if necessary. The source data in this case is the scram rate
of 66 commercial nuclear power plants obtained from the annual observed scram
data from 1984-1993. The data depicts an increase in the zero scram incidents
with time from 1.5% in 1986 to 33% 1993. To analyze this kind of count data with
excess zeros a Zero-inflated Poisson (ZIP) distribution on unplanned scrams with
appropriate link functions and vague normal and inverse Wishart priors and
hyperpriors on regression parameters is being proposed. The results obtained
will be compared with the Poisson model. To obtain the posterior estimates of
the parameters along with credible interval Markov Chain Monte Carlo (MCMC)
technique will be applied using WinBugs.
- Data Source: NRC
- Original Source: Martz H.F., Parker R. L. and Ramuson D. M. (1999) , Estimation of
trends in the scram rate at nuclear power plants, Technometrics, 41, 352-364.
Ghosh S. K., Mukhopadhyay P.and Lu JC. (2006), Bayesian
analysis of zero-inflated regression models, Journal of statistical planning
and inference, 136, 1360-1375.
- Code used: WinBUGS code
- Team members: Alexander Griffing
- Presentation time: Dec. 8, 4:50-5:05
- Title: A Bayesian approach to cryptogram language estimation
- Abstract: Hiding written text by permuting the meanings of the letters is a basic method of
cryptography. I plan to develop a Bayesian method of estimating the language in which the original
text was written given the encrypted text. First, language models for several languages that use
the roman alphabet will each be built using a non-informative Dirichlet prior for the letter
probabilities which will be updated using a training subset of text from Project Gutenberg,
a free online text repository. The final model will be a non-informative mixture of these
language models. Second, samples will be taken from a test subset and MCMC will be used to
estimate the posterior language distribution for these sample texts. The estimates will be
compared to the true text languages.
- Data Source: Project Gutenberg and
Index of coincidence
- Original Source: Chen and Goodman (1998) and
Hasinoff (2003)
- Code used: C code
- Fall semester, 2005
- Team members: Hugh Crews, Emily Hohmeister, Rebecca Horowitz
- Presentation time: Nov 28th, 2005, 10:15 - 10:35
- Title: The Silver Lining: A Bayesian Approach to Analyzing Cloud Seeding Data
- Abstract: In Simpson (1975), data from both seeded and unseeded clouds were
analyzed in a Bayesian setting. The amount of rainfall was assumed to
follow a skewed Gamma distribution with the same shape parameter for both
seeded and unseeded clouds. The prior used was a vague prior. Simpson
(1972) provided results indicating that the use of the gamma distribution
is appropriate. Another possibility suggested in Simpson (1972) is the
Rayleigh distribution. We plan to use the Rayleigh distribution and test
for a multiplicative effect, comparing our results and assumptions to
those of the gamma distribution. We can compare these distributions
using the Bayes factor as well as plots. We will also compare the results
from a Bayesian setting with a frequentist approach.
- Data Source: DASL
Clouds data set
Original Source: Simpson, Alsen and Eden (1975).
A Bayesian analysis of
a multiplicative treatment effect in weather modification, Technometrics,
17, 161-166.
Additional Source: Simpson, J. (1972).
Use of the gamma distribution in single cloud rainfall analysis,
Monthly Weather Review, 100, 309-312.
- WinBUGS code: WinBUGS code
- Team members: Chia-Cheng Chen, Tsuei-Long Chen and Mingyan Huang
- Presentation time: Nov 28th, 2005, 10:40 - 11:00
- Title: Prediction of US Temperatures and Bayesian Model Determination.
- Abstract: The purpose of this project is to find a way so that we could predict the
average January minimum temperature in degrees Fahrenheit of any U.S.
cities when the information of the latitude and longitude for that city
was given. We could first check plots of Lat vs. Long, JanTemp vs. Lat, and JanTemp
vs. Long (Lat stands for Latitude, Long stands for Longitude, and JanTemp
stands for average January minimum temperature) to see if there is any
information from these plots. Then, consider three competing models that
could show the relationships among JanTemp, Lat, and Long. Finally, use
the method of Bayesian model selection to check which model is adequate.
Therefore, based on the chosen model, we could have a better prediction of
temperature of any U.S. cities.
- Data Source: DASL US
temperatues data set
- Team members: Xiaohua Gong and Shufang Liu
- Presentation time: Nov 30, 2005, 10:15-10:35
- Title: Bayesian Logistic Regression for Spam Email Classification
- Abstract: In this study, we built a model to predict whether an email is normal or
spam according to some mostly common occuring words in the emails. We use
the spam dataset as our training dataset. To estimate the parameters,
namely beta, we adopted a Bayesian framework where under the assumption of
gaussian prior for the parameters beta, we obtain the MAP (maximum a
posteriori) estimate of beta. To choose an appropriate prior variance of
beta that is needed for MAP estimation, we took the cross validation
approach to select the values of prior variance that minimized the cross
validation error. Cross validation error for the chosen value of prior
variance is reported as an estimate of the testing error of our model.
- Data Source: Spam Data Info
and the spam data set
- Final report: Gong & Liu report and LaTex file
- Slides: Gong & Liu's presentation slides
- Team members: Martin Heller
- Presentation time: Nov 30, 2005, 10:40-11:00
- Title: Bayesian functional data analysis when the coefficients are random
- Abstract: Of interest is finding a probability density of functions
determining the daily electrical consumption for buildings in a research park.
Cubic splines were chosen to model the data because of their ability to emulate
a wide variety of functions. Random coefficient models in a Bayesian context
will be used to get an estimate of the posterior distribution.
- Data Source: undisclosed
Last updated December 10, 2005.
Back to ST 740 Home Page