Course info

Course syllabus

Class notes

Homework assignments and solutions

Project

Examples

Announcements

 

Course objective:


Missing data are ubiquitous in almost every area of scientific inquiry, and especially in health sciences research involving human subjects.

The classical definition of missingness is that data that were intended to be collected in a prospective study were not. For example, in a medical study designed to collect data longitudinally on every participant at a prespecified series of follow-up times, some subjects may fail to appear at the clinic for intended measurements at one or more times or, more ominously, drop out of the study and never return after a certain point. If the reasons for failure to appear or dropout are related to the issues under study, e.g., if subjects who are benefiting less from their assigned treatments are more likely to drop out, intuitively, failure to acknowledge this somehow could distort conclusions. Missing data of this type are such a great challenge in pharmaceutical and biotechnology research that in 2010 the US Food and Drug Administration asked the National Research Council of the National Academy of Sciences to convene an expert Panel on the Handling of Missing Data Clinical Trials to develop guidance on how missingness should be handled in the regulatory context.

Missingness also occurs in data that arise in other contexts. For example, in retrospective analysis of observational data already collected, such as those from completed studies or captured in other large databases, it is routinely the case that not all variables are available on all subjects or other units. Contrary to popular belief, "big data" are not somehow exempt from issues of missingness. For example, the use of electronic health records, which are of course observational in nature, is of great current interest in the study of the comparative effectiveness of drugs and other interventions. The fact that some subjects have relatively more observations than others often reflects the fact that less healthy individuals tend to have more encounters with the healthcare system. These same individuals may also be more likely to receive certain interventions and to have health outcomes worse than those those of healthier individuals. Thus, there is a relative "missingness" of information among different individuals that is likely associated with the issues under study.

Missing data have important implications for analysis. At the very least, there is a loss of information and reduction in precision of inference on the population of interest relative to that intended. Of much greater concern is the potential for biased and misleading inferences that can result if the reasons for missingness are related to outcomes of interest. Accordingly, principled methods to take this challenge into appropriate account are required.

This course provides an overview of modern statistical frameworks and methods for analysis in the presence of missing data. Both methodological developments and applications are emphasized. The course provides a foundation in the fundamentals of this area that will prepare students to read the current literature and to have broad appreciation the implications of missing data for valid inference.

Course prerequisites


ST 522, Statistical Theory II, and ST 552, Linear Models and Variance Components, or equivalents. Students should also have been exposed to SAS and R and have reasonable proramming skills. Please see the instructor if you have questions about the suitability of your background.

Course topics


See the class notes for more detailed information