This page provides R code, shell scripts, and instructions, for performing haplotype similarity regression (HSreg) for multi-marker association analysis.
HSreg implements a similarity-based regression method to detect associations between traits and multimarker genotypes. The method uses genetic/haplotype similarity to aggregate information from multiple polymorphic sites (e.g., SNPs or a mixture of different polymorphisms), and regresses trait similarities for pairs of "unrelated" individuals on their genetic similarities to access the gene-trait association. The similarity regression allows for covariates, uses phase-independent similarity measures to bypass the needs to impute phase information, and is applicable to traits of general types (e.g., quantitative and qualitative traits). It can be shown that the similarity model is equivalent to the random effects haplotype analysis and explicitly models the non-additive effects among markers. These features makes it an ideal tool for evaluating association between phenotype and marker sets defined by haplotypes, genes or pathway.
The methods implemented in this software are described in the following papers.
The code is written in R and has been tested under Linux, Windows, and Solaris 10. Notes on using the code are provided on this page. The source code and some example data are provided in the download section below.
For a large number of SNPs (thousands) it is not feasible to run the calculation for all of them at once. The technique used is additive in the sense that the calculation can be performed on successive windows of SNPs and then the results combined. The size of this window can be varied, and a shell script (batch file under Windows) is provided for the examples to perform this windowed version of the calculation. These scripts can be modifed to match the requirements of your own analysis.
This additivity property also allows the analysis to be split into chunks suitable for running on multiple nodes of a compute cluster. A script is provided for combining the results from multiple subset analyses to get a final p-value.
The HSreg code requires as input: a file of genotype calls for the individuals in the sample, a list of trait values (one per individual, e.g. for a binary trait, 1 = unaffected, 2 = affected, and 0 = missing data), and (optionally) the values of explanatory variables (covariates), with each variable having one value per individual.
The code supports genotype data in the format used by IMPUTEv2, and in TPED format as used by PLINK. Some tools are provided for reformatting data in the haplo.stats format, but these are suitable only for small datasets. For a complete description of the IMPUTE v2 format see the Genotype File Format section of Jonathan Marchini's File Formats web page. For a complete description of TPED format, see the PLINK data formats web page. For a complete description of the haplo.stats format, please see the help in the R package.
The example code reads the values of the trait of interest and any explanatory variables from a text file. This file has a column for individual identifiers, a column of trait values and (optionally) columns for covariate values. It has a header line naming the columns, and then one line per individual listing the values of these variables for that individual.
The code produces a test statistic for the association between the trait and genetic variants of the region under consideration, and a p-value for this test statistic based on the weighted chi-squared distribution.
This example demonstrates running the analysis for all SNPs in a set at once. The R code can be run a line at a time at the R command line, or through a script that runs R in batch mode. The script is controlled by an configuration file which tells it where to find the genotype and trait input files, and provides some other parameters for the analysis. See below for a description of the configuration.
The code for example 1 can be found in example_1.readme.r. This file contains some comments and some R code. The data for example_1 are a file of genotype information, chr16.study.gtypes, and a file of trait and covariate values, trait_1.txt.
These example data are based on an example provided with the IMPUTE v2 software (IMPUTE v2), which in turn was extracted from the Wellcome Trust Case Control Consortium data. The data set has 20 subjects and 233 SNPs, this is a therefore a very small example and should run in a few seconds or less (on any reasonable system).
chr16.study.gtypes contains genotype data for the 20 subjects in IMPUTE v2 ".gtypes" format. For a complete description of this format see the Genotype File Format section of Jonathan Marchini's File Formats web page
trait_1.txt contains trait values and values for two explanatory variables for 20 individuals. This file contains 4 columns (individual identifier, trait value, first covariate, second covariate). The trait value is binary, and is coded (in this example as) 1=control, 2=case, 0=missing. In the example file one trait value is missing. The HSreg code will remove all data (trait, covariate and genotype) for the subject with missing data before running the analysis.
To run this example you can start R and, at the prompt, enter:
The result of the calculation will be displayed on the screen.
Alternatively, to run the example in batch mode, at a shell command prompt, type:
R CMD BATCH example_1.readme.r
In this case output from the program will be written to a text file named example_1.readme.Rout.
In order to see the steps involved in using the software, you can open example_1.readme.r in a text editor and run each line separately by cutting and pasting to the R command line.
Finally, there is also a shell script (batch file under Windows) that will run the analysis from the command line. To run the shell script in a Unix-like OS use the following command.
Under Windows use the following command.
The file example_1/config_1.txt contains lines of the form:
name = value
These lines set the parameters for the analysis, including the names of the input files, the trait type, and the approximation method to be used to get a p-value. The file also contains comments describing the values that can be set to control the analysis.
This example demonstrates the technique of performing the analysis on windows of SNPs.
We use the "Example 1" data, a configuration file, example_1/config_1_win.txt, and a shell script (batch file under Windows). The configuration file contains lines of the form:
name = valueThis file is used to specify the input data files, and the parameters of the analysis. Look at the example supplied for details.
To run this example under Windows, open a command prompt, change to the directory where you saved the example files, and enter:
run_analysis.bat example_1\config_1_win.txt > example_1_subset.out
To run this example under Linux (or other Unix-like system), in a terminal window, change to the directory where you saved the example files and enter:
chmod +x run_analysis.csh ./run_analysis.csh example_1/config_1_win.txt > example_1_subset.out
When the command finishes, look at the output file, example_1_subset.out, using a text editor. The final result of the calculation will be at the end of the file. You should get the same result from this subset analysis as from the all-in-one analysis described above.
This example is similar to example 1, but the input data is in haplo.stats format.
The code for example 2 can be found in example_2.readme.r. This file contains some comments and some R code. The data for example_2 are a file of genotype information, mygeno.txt, and a file of trait and covariate values, for this example there are two different versions of the traits file, one has binary trait values and the other has quantitative trait values. These files are binary_trait.txt and quantitative_trait.txt.
The example data set has 400 subjects and 3 SNPs, this is a therefore a very small example and should run in a few seconds or less (on any reasonable system).
binary_trait.txt contains binary phenotype values and values for two covariates for 400 individuals. This file contains 4 columns (individual identifier, phenotype value, first covariate, second covariate).
quantitative_trait.txt contains quantitative phenotype values and values for two covariates for 400 individuals. This file contains 4 columns (individual identifier, phenotype value, first covariate, second covariate).
To run this example you can read the file example_2.readme.r. At the top of the file are some lines of code that load the data sets for the example. Copy and paste these to an R command prompt. Below these lines are a number of different cases illustrating how to use the code. Each of these lines calls the main function and returns a result. In each case you should run the line of code, print out the value of "result" and verify that the value you got is the same as that recorded in example_2.readme.r.
Example 3 is larger than examples 1 or 2. It has 400 individuals and 2000 SNPs, with one of the SNPs artificially associated with the disease. In the example_3 subdirectory you will find genotype data in IMPUTEv2 format, geno_3.txt, and in TPED format geno_3_tped.txt.
To run example 3 under Windows, type one of the following commands at a command prompt.run_analysis example_3\config_3_win.txt run_analysis example_3\config_3_tped_win.txt
The first of these commands uses the IMPUTEv2 format data, the second uses the TPED format data. In this example the two files contain identical genotype information, and you should ge identical results from these two analyses. In general, the IMPUTEv2 format can be used to provide "dosage" data rather than explicit SNP calls.
To run Example 3 under a Unix-like OS, type one of the following commands at a shell prompt.
./run_analysis.csh example_3/config_3_win.txt ./run_analysis.csh example_3/config_3_tped_win.txt
The analysis is run using windows of 50 SNPs. This gives 40 windows. The analysis of each window should take only a second or so. The results of the analysis are written to files in the example_3 subdirectory. These files are as follows.
This is a large example representing the size of data used in real experiments. The data for the example is not delivered with the HSreg package, but is generated by an R script, sim.data.r, that does come with the package. The script will generate data for 1000 cases, 1000 controls and 5000 SNPs. An effect SNP is included in the data as SNP number 77.
In subdirectory sim_1, there is a config file, and an R random number generator seed. If these are used when running the simulation, the results described in comments at the top of sim.data.r should be obtained. To generate the simulation data, start R, set the working directory to the main HSreg directory, and type:
Generating the simulated data takes some time (up to 5 minutes).
To run the analysis on the simulated data under Windows type the following at a command prompt.
run_analysis.bat sim_1\config_sim.txt > sim_1.out
To run the analysis under a Unix-like OS type the following at a shell prompt.
./run_analysis.csh sim_1/config_sim.txt > sim_1.out