NEW: SIMreg is now available as an R package. See the download section to download it. The package contains a user manual and help text for the SIMreg functions.
This page provides R code, shell scripts, and instructions, for performing weighted haplotype similarity regression (SIMreg) for multi-marker association analysis.
SIMreg is a tool to perform maker-set association analysis. Association analysis at gene, pathway, or exon levels (here by marker-set analysis) hold great promise in evaluating modest etiological effects of genes with GWAS or sequence data. However, currently available methods target detection of either rare or common variants but not both, assume additive and same-direction effects for loci within a marker set, use test-based frameworks that cannot accommodate covariates such as population structure, and do not have the capacity to assess interaction effects. SIMreg provides a flexible, powerful and computationally efficient alternative for conducting marker-set analysis. It has the following features that distinguish it from other methods.
The methods implemented in this software are described in the following papers.
The code is written in R and has, at present, only been tested under Linux. Notes on using the code are provided on this page. The source code is provided in the download section below.
For a large number of SNPs (thousands) it is not feasible to run the calculations for all of them sequentially on a single processor. So the code is designed to run on portions of a chromosome at a time, and to be run on multiple processors in a compute cluster. Each job submitted to the cluster analyses one portion, or chunk, of a chromosome. Within a chunk the calculations are performed on successive "windows" of consecutive SNPs genotyped on the chromosome. The size of these windows is controlled by a parameter to the code, as is the amount of overlap between successive windows. The results for each window are written to (tab-delimited) text files for later aggregation into a file of results for the whole chromosome.
The SIMreg code requires as input:
The code currently only supports genotype data in the format used by IMPUTEv2. For a complete description of the IMPUTE v2 format see the Genotype File Format section of Jonathan Marchini's File Formats web page.
The code reads the values of the trait of interest and any explanatory variables from a single text file. This file must have a column for individual identifiers, a column of trait values and (optionally) columns for covariate values. It has a header line naming the columns, and then one line per individual listing the values of the variables for that individual.
For the SIMreg tests the code produces a p-value for a main effect test, a gene by environment effect test, and a joint test. These are written to a tab-delimited text file along with the number of the first SNP in the window to which these values apply. Splitting an analysis across many jobs on a cluster results in a large number of small results files. These can be combined to get a single file per chromosome.
SIMreg 1.31: R Package for Unix-based systems.