Jung-Ying Tzeng

SIMreg: Similarity Regression

NEW: SIMreg is now available as an R package. See the download section to download it. The package contains a user manual and help text for the SIMreg functions.

This page provides R code, shell scripts, and instructions, for performing weighted haplotype similarity regression (SIMreg) for multi-marker association analysis.

SIMreg is a tool to perform maker-set association analysis. Association analysis at gene, pathway, or exon levels (here by marker-set analysis) hold great promise in evaluating modest etiological effects of genes with GWAS or sequence data. However, currently available methods target detection of either rare or common variants but not both, assume additive and same-direction effects for loci within a marker set, use test-based frameworks that cannot accommodate covariates such as population structure, and do not have the capacity to assess interaction effects. SIMreg provides a flexible, powerful and computationally efficient alternative for conducting marker-set analysis. It has the following features that distinguish it from other methods.

  1. The method uses genetic similarity to aggregate information across markers, and incorporates adaptive weights depending on allele frequencies to accommodate rare and common variants.
  2. Collapsing information at the similarity level instead of genotype level bypasses the worry of cancelling signals of opposite etiological effects, and is applicable on any class of genetic variant without having to dichotomize the allele types.
  3. It is regression-based, naturally incorporates covariates, and is applicable to both observed and imputed (dosage) genotypes.
  4. We use a rigorous analytical derivation to demonstrate that collapsing information through similarity status explicitly captures the locus-locus interactions among all markers in a set.
  5. It provides a series of test statistics that can be used to assess (a) marginal genetic main effect (G test), (b) gene-environment interaction effects (GxE test), or (c) the joint effects of both types simultaneously. These tests do not require permutations to assess significance, and are fast to compute.
SIMreg is an extension of (incorporates all features and functions of) HSreg.

The methods implemented in this software are described in the following papers.

Haplotype-Based Association Analysis via Variance-Components Score Test  Tzeng and Zhang 2007
Gene-Trait Similarity Regression for Multimarker-Based Association Analysis  Tzeng et al. 2009
Detecting gene and gene-environment effects of common and rare variants on quantitative traits: A marker-set approach using gene-trait similarity regression  Tzeng et al. 2010 (submitted)

General Information

The code is written in R and has, at present, only been tested under Linux. Notes on using the code are provided on this page. The source code is provided in the download section below.

For a large number of SNPs (thousands) it is not feasible to run the calculations for all of them sequentially on a single processor. So the code is designed to run on portions of a chromosome at a time, and to be run on multiple processors in a compute cluster. Each job submitted to the cluster analyses one portion, or chunk, of a chromosome. Within a chunk the calculations are performed on successive "windows" of consecutive SNPs genotyped on the chromosome. The size of these windows is controlled by a parameter to the code, as is the amount of overlap between successive windows. The results for each window are written to (tab-delimited) text files for later aggregation into a file of results for the whole chromosome.

Input Data

The SIMreg code requires as input:

  • A file of genotype calls (per chromosome) for the individuals in the sample.
  • A list of trait values (one value per individual)
  • The values of explanatory variables (covariates), with each variable having one value per individual. (These are optional.)

The code currently only supports genotype data in the format used by IMPUTEv2. For a complete description of the IMPUTE v2 format see the Genotype File Format section of Jonathan Marchini's File Formats web page.

The code reads the values of the trait of interest and any explanatory variables from a single text file. This file must have a column for individual identifiers, a column of trait values and (optionally) columns for covariate values. It has a header line naming the columns, and then one line per individual listing the values of the variables for that individual.

Results

For the SIMreg tests the code produces a p-value for a main effect test, a gene by environment effect test, and a joint test. These are written to a tab-delimited text file along with the number of the first SNP in the window to which these values apply. Splitting an analysis across many jobs on a cluster results in a large number of small results files. These can be combined to get a single file per chromosome.

R package Download

SIMreg 1.31: R Package for Unix-based systems.

SIMreg 1.31: R Package for Windows.

SIMreg 1.31; R Package User Manual.