Software

SIMreg: gene-trait similarity regression for marker-set association analysis (SIMreg includes HSreg as part of the package)
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "SIMreg codes" before download.
SIMreg is a tool to perform maker-set association analysis. Association analysis at gene, pathway, or exon levels (here by marker-set analysis) hold great promise in evaluating modest etiological effects of genes with GWAS or sequence data. However, currently available methods target detection of either rare or common variants but not both, assume additive and same-direction effects for loci within a marker set, use test-based frameworks that cannot accommodate covariates such as population structure, and do not have the capacity to assess interaction effects. SIMreg provides a flexible, powerful and computationally efficient alternative for conducting marker-set analysis. It has the following features that distinguish it from other methods.
  1. The method uses genetic similarity to aggregate information across markers, and incorporates adaptive weights depending on allele frequencies to accommodate rare and common variants.
  2. Collapsing information at the similarity level instead of genotype level bypasses the worry of cancelling signals of opposite etiological effects, and is applicable on any class of genetic variant without having to dichotomize the allele types.
  3. It is regression-based, naturally incorporates covariates, and is applicable to both observed and imputed (dosage) genotypes.
  4. We use a rigorous analytical derivation to demonstrate that collapsing information through similarity status explicitly captures the locus-locus interactions among all markers in a set.
  5. It provides a series of test statistics that can be used to assess (a) marginal genetic main effect (G test), (b) gene-environment interaction effects (GxE test), or (c) the joint effects of both types simultaneously. These tests do not require permutations to assess significance, and are fast to compute.
SIMreg is an extension of (incorporates all features and functions of) HSreg.
R code and test datasets for CCRET, a New Method For Detecting Associations With Rare Copy-Number Variants (2015) by Tzeng J.Y., Magnusson, P.K.E., Sullivan, P.F., The Swedish Schizophrenia Consortium, Szatkiewicz, J.
R code and test datasets for "Module-based Association Analysis for Omics Data with Network Structure" (2013) by Zhi Wang, Arnab Maity, Chuhsing K. Hsiao, Deepak Voora, Rima Kaddurah-Daouk, Jung-Ying Tzeng; PLos One, 10(3):e0122309
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "network KMR codes" before download.
R code "Complete Effect-Profile Assessment in Association Studies with Multiple Genetic and Multiple Environmental Factors" (2013) by Zhi Wang, Arnab Maity, Megan Neely, Jung-Ying Tzeng; submitted.
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "multi-G multi-E KMR codes" before download.
The R code implements a kernel machine regression to study the effect profile of a marker set and an environmental factor set, including the joint effect of G and E, the GxE effect, the conditional G effect and the conditional E effect.
R code and test datasets for "Analysis of Gene-Gene Interactions Using Gene-Trait Similarity Regression" (2014) by Xin Wang, Michael P. Epstein, Jung-Ying Tzeng; Human Heredity, In press.
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "SimReg GxG QT codes" before download.
The R code implements a gene-trait similarity regression (SimReg) to perform gene-based tests for detecting gene-gene interactions.
R code and test datasets for "Integrative Gene Set Analysis of Multi-platform Data with Sample Heterogeneity" (2014) by Jun Hu and Jung-Ying Tzeng; Bioinformatics, In press.
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "Multiplatform gene-set codes" before download.
In this manuscript, we investigate the performance of different multi-platform methods for gene set analysis using extensive simulated studies. When there is no sample heterogeneity, we found that INT (Tyekucheva S et al. 2011) and Hotelling's T2 method have the best performances compared to other methods. However, when sample heterogeneity presents, the existing methods suffer significant power loss. Sample heterogeneity is commonly encountered in complex diseases, and its impact on power for multi-platform analysis is more substantial than for single-platform analyses because the level of heterogeneity and the heterogeneous samples often vary across platforms. To address this issue of sample heterogeneity, we proposed three different strategies, MPMWS (Multi-Platform Mann-Whitney Statistics), MPORT (Multi-platform outlier robust t-statistics) and MPLRS (Multi-Platform Likelihood Ratio Statistics) for multi-platform gene set analysis. Simulation results suggested that the non-parametric MPMWS had superior and robust performance regardless the degree of heterogeneity is little or high. Based on the results of the simulation and real data applications, we recommend to using both MPMWS and INT, which appear to identify the orthogonal yet relevant biological gene sets.
WT-RNAseq: Weighted Test (WT) for Assessing RNA-Seq Differential Expression Levels with Low-confidence Aligned Reads.
Pongpanich*, Tzeng*, Nielsen 2012. [* Equal contribution.] Submitted.
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "WT-RNAseq" before download.
WT-RNAseq is a test that incorporates the unmapped reads from short read tailored aligners (e.g., TopHat) into differential gene expression analysis. Specifically, we use BLAST to align the unmapped reads, and then assign each mapped read (by either BLAST or TopHat) a weight that reflects its mapping confidence. Then the expression level of a gene is quantified using the weighted count of the reads that mapped to that gene, and a statistical test based on those weighted counts is performed to detect differentially expressed genes.
HSreg: Haplotype Similarity Regression for Multi-marker Association Analysis.
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "HSreg codes" before download.
HSreg implements a similarity-based regression method to detect associations between traits and multimarker genotypes. The method uses genetic/haplotype similarity to aggregate information from multiple polymorphic sites (e.g., SNPs or a mixture of different polymorphisms), and regresses trait similarities for pairs of "unrelated" individuals on their genetic similarities to access the gene-trait association. The similarity regression allows for covariates, uses phase-independent similarity measures to bypass the needs to impute phase information, and is applicable to traits of general types (e.g., quantitative and qualitative traits). It can be shown that the similarity model is equivalent to the random effects haplotype analysis and explicitly models the non-additive effects among markers. These features make it an ideal tool for evaluating association between phenotype and marker sets defined by haplotypes, genes or pathway.
CasANOVA, haplo.CasGLM (original PLhap): Penalized-Likelihood Regression for Haplotype Specific Analysis.
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "PLhap codes" before download.
PLhap implements a penalized regression approach to systematically evaluate the pattern and structure of haplotype effects. The method takes unphased genotype data and outputs the haplotype group structure based on their effect size. PLhap differs from the typical way of haplotype analysis, where haplotype inference focuses on relative effects compared with an arbitrarily chosen baseline haplotype. The typical analysis does not depict the effect structure unless an additional inference procedure is used in a secondary post hoc analysis, and such analysis tends to lack power. By putting an L1 penalty on the pairwise difference of the haplotype effects, PLhap avoids the need to choose a baseline haplotype, and simultaneously carries out effect estimation and effect comparison of all haplotypes. It can serve as a tool to comprehend candidate regions identified from a genome or chromosomal scan.
MarkerQC: A Quality Control Algorithmm for Filtering SNPs in Genomewide Association Studies.
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "Marker QC filter" before download.
Marker QC implements an algorithm that is based on principal component analysis and clustering analysis to identify genotyping outliers. The method minimizes the decisions of arbitrary cutoff values, allows a collective consideration of all QC features, and provides conditional thresholds contingent on the values of other QC variables (such as different missing proportion threshold for different minor allele frequency).
Hap-clustering: R code for Evolutionary-based Haplotype Clustering.
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "Hap clustering codes" before download.
Hap-Clustering implements a regression-based approach using clustered haplotypes to assess haplotype-phenotype association. It generalizes the probabilistic clustering methods of Tzeng to the generalized linear model (GLM) framework established by Schaid et al. (2002). Hap-clustering uses unphased genotypes and accounts for both phase uncertainty and clustering uncertainty when performing association tests. Its GLM framework allows adjustment of covariates and can model qualitative and quantitative traits. It is best used to evaluate the overall haplotype association.
QSHS: R Code for Quadratic Statistics of Haplotype Similarity (QSHS).
To facilitate updating, please send tzeng@stat.ncsu.edu a blank email with subject "QSHS codes" before download.
QSHS implements a class of association tests based on haplotype similarity. Specifically, many measures of haplotype similarity can be expressed in the same quadratic form, and we give the general form of the variance. These methods can be applied to either phase-known or phase-unknown data.